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Preface 



The 7th International Conference on Implementation and Application of Auto- 
mata (CIAA 2002) was held at the Universite Frangois Rabelais of Tours, in 
Tours, France, on July 3-5, 2002. 

This volume of Lecture Notes in Computer Science contains all the papers 
that were presented at CIAA 2002, as well as the abstracts of the poster papers 
that were displayed during the conference. 

The conference addressed issues in automata application and implementa- 
tion. The topics of the papers presented in this conference ranged from automata 
applications in software engineering, natural language and speech recognition, 
and image processing, to new representations and algorithms for efficient imple- 
mentation of automata and related structures. 

Automata theory is one of the oldest areas in computer science. Research in 
automata theory has always been motivated by its applications since its early 
stage of development. In the 1960s and 1970s, automata research was motivated 
heavily by problems arising from compiler construction, circuit design, string 
matching, etc. In recent years, many new applications of automata have been 
found in various areas of computer science as well as in other disciplines. Ex- 
amples of the new applications include statecharts in object-oriented modeling, 
finite transducers in natural language processing, and nondeterministic finite- 
state models in communication protocols. Many of the new applications cannot 
simply utilize the existing models and algorithms in automata theory in the so- 
lution to their problems. New models, or modifications of the existing models, 
are needed to satisfy their requirements. Also, the sizes of the typical problems 
in many of the new applications are astronomically larger than those used in the 
traditional applications. New algorithms and new representations of automata 
are in demand to reduce the time and space requirements of the computation. 

The CIAA conference series provided a forum for these new problems and new 
challenges. In these conferences, both theoretical and practical results related to 
the application and implementation of automata were presented and discussed, 
and software packages and toolkits were demonstrated. The participants of the 
conference series were from both research institutions and industry. 

We wish to thank all the program committee members and referees for their 
efforts in refereeing and selecting papers. 

We wish to thank EATCS and ACM SIGACT for their sponsorship, and the 
following institutions for their donations to the conference: 

- CNRS, 

- the Scientific Council of the University of Tours, 

- the UFR de Sciences et Techniques of the University of Rouen, 

- the Administrative Council Region Centre, 

- the Administrative Council Departement d’Indre-et- Loire, 
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- the Municipality of Tours, 

- the Xerox Research Centre Europe (Grenoble, France). 

We also thank the editors of the Lecture Notes in Computer Science series 
and Springer- Verlag, in particular Ms. Anna Kramer, for their help in publishing 
this volume. 
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Abstract. The edit-distance of two strings is the minimal cost of a 
sequence of symbol insertions, deletions, or substitutions transforming 
one string into the other. The definition is used in various contexts 
to give a measure of the difference or similarity between two strings. 
This dehnition can be extended to measure the similarity between two 
sets of strings. In particular, when these sets are represented by au- 
tomata, their edit-distance can be computed using the general algorithm 
of composition of weighted transducers combined with a single-source 
shortest-paths algorithm. More generally, in some applications such as 
speech recognition and computational biology, the strings may represent 
a range of alternative hypotheses with associated probabilities. Thus, 
we introduce the definition of the edit-distance of two distributions of 
strings given by two weighted automata. We show that general weighted 
automata algorithms over the appropriate semirings can be used to com- 
pute the edit-distance of two weighted automata exactly. The algorithm 
for computing exactly the edit-distance of weighted automata can be 
used to improve the word accuracy of automatic speech recognition 
systems. More generally, the algorithm can be extended to provide an 
edit-distance automaton useful for rescoring and other post-processing 
purposes in the context of large-vocabulary speech recognition. In the 
course of the presentation of our algorithm, we also introduce a new 
and general synchronization algorithm for weighted transducers which, 
combined with e-removal, can be used to normalize weighted transducers 
with bounded delays. 



1 Introduction 

The edit- distance of two strings is the minimal cost of a sequence of symbol 
insertions, deletions, or substitutions transforming one string into the other m- 
It is used in many applications such as computational biology to give a measure 
of the difference or similarity between two strings. 

There exists a classical dynamic-programming algorithm for computing the 
edit-distance of two strings xi and X 2 in time 0 (|a::i||a:: 2 |), where jccil, * = 1,2, 
denotes the length of the string Xi [Mj. The algorithm is a special instance of 
a single-source shortest-paths algorithm applied to a directed graph expanded 
dynamically. Ukkonen improved that algorithm by reducing the size of the di- 
rected graph that needs to be expanded [33] . The complexity of his algorithm is 
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0 (dmax{|xi|, |x2|}), where d is the edit-distance of xi and X2- The algorithm 
is more efficient for strings such that the distance d is small with respect to |xi| 
and |x2|.0 

The definition of edit-distance can be extended to measure the similarity 
between two sets of strings. It is then the minimum of the edit-distance between 
any two strings, one in each set. In particular, when these sets are represented by 
(unweighted) automata, their edit-distance can be computed using the general 
algorithm of composition of weighted transducers combined with a single-source 
shortest-paths algorithm m- We briefly present that algorithm and point out 
its generality for dealing with more complex edit-distance models including for 
example transpositions that can be represented by weighted transducers. The 
classical dynamic programming algorithm for strings can be viewed in fact as a 
special instance of that algorithm. 

In several applications such as speech recognition, computational biology, 
handwriting recognition, or topic spotting, the strings may represent a range of 
alternative hypotheses with associated probabilities given as a weighted automa- 
ton. Thus, we introduce the definition of the edit-distance of two distributions of 
strings given by weighted automata, and give a general algorithm for computing 
that edit-distance. 

Note that the number of hypotheses compactly represented by such weighted 
automata can be very large in many applications. In speech recognition applica- 
tions for example, even relatively small automata may contain more than four 
billion distinct strings. Thus, the use of the classical edit-distance computation 
for strings seems to be prohibitive here: the number of pairs of strings can be 
larger than four billion squared. The storage and use of the results would also be 
an issue. Therefore, it is crucial to keep the compact automata representation 
of the input strings, and provide an algorithm for computing the edit-distance 
that takes advantage of that representation. 

The edit-distance of weighted automata can be used to improve the word 
accuracy of automatic speech recognition systems. This was first demonstrated 
by [ 32 | by reducing automata to their N best strings, later by m who gave an 
algorithm for computing an approximation of the edit-distance and an approx- 
imate edit-distance automaton. An alternative approach was described by m 
who used an A* heuristic search of deterministic machines and various pruning 
strategies, some based on the time segmentation of automata, to compute that 
edit-distance in the context of speech recognition. However, that approach does 
not produce an edit-distance automaton. 

We show that general weighted automata algorithms over the appropriate 
semirings can be used to compute the edit-distance of two weighted automata 
exactly. More generally, our algorithm can be used to provide an exact edit- 
distance automaton useful for rescoring and other post-processing purposes in 
the context of large- vocabulary speech recognition. The algorithm is general and 
can be used in many other contexts such as computational biology. 



^ We refer the reader to um for general surveys of edit-distance and other text 
processing algorithms. 
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The paper is organized as follows. We first introduce some preliminary defini- 
tions and notation related to semirings and automata that will be used in the rest 
of the paper (section 2). In section 3, we introduce various definitions related 
to edit-distances, in particular the definition of the edit-distance of weighted 
automata. We then give a brief overview of several weighted automata algo- 
rithms used to compute the edit-distance of weighted automata (section 4). 
They include composition and e-removal of weighted transducers, determiniza- 
tion of weighted automata, and a new and general synchronization algorithm for 
weighted transducers which, combined with e-removal, can be used to normalize 
weighted transducers with bounded delays. Section 5 presents in detail a general 
algorithm for computing the edit-distance of two unweighted automata. Finally, 
the algorithm for computing the edit-distance of weighted automata, the proof 
of its correctness, and the construction of the edit-distance weighted automaton 
are given in section 6. 

2 Preliminaries 

A weighted automaton is a finite automaton in which each transition is labeled 
with some weight element of a semiring in addition to the usual alphabet symbol. 

Definition 1 ( [I16| i. A system (K, 0, 0, 0, 1) is a semiring if: 

1. (K, 0,0) is a commutative monoid with identity element 0; 

2. (K, 0,1) is a monoid with identity element 1; 

3. 0 distributes over 0; 

4-- 0 is an annihilator for 0.' for aZZ a G K, a 0 0 = 0 0 a = 0. 

Thus, a semiring is a ring that may lack negation. Some familiar examples are the 
Boolean semiring B = ({0, 1}, V, A, 0, 1), or the real semiring TZ = (M+, 0, x, 0, 1) 
used to combine probabilities. Two semirings particularly used in the following 
sections are: the log semiring £ = (M U {oo}, 0iog, +, oo, 0) [22| which is isomor- 
phic to TZ via a log morphism with: 

Vo, 6 G M U {oo}, a 0iog b = — log(exp(— o) 0 exp(— 6)) 

where by convention: exp(— oo) = 0 and — log(O) = oo, and the tropical semiring 
T = (R_|_ U {oo}, min, 0, oo, 0) which is derived from the log semiring using the 
Viterbi approximation. 

A semiring (K, 0, 0, 0, 1) is said to be weakly left divisible if for any x and y 
in K such that x (By 0, there exists at least one z such that x = (x 0 j/) 0 z. 
We can then write: z = {x(By)~^x. Furthermore, we will assume then that z can 
be found in a consistent way, that is: ((u 0x) (B (u0 y))~^(u 0 x) = (x (B y)~^x 
for any a;, y, rt G K such that u yf 0. A semiring is zero-sum-free if for any x and 
y in K, a: 0 2 / = 0 implies x = y = 0. Note that the tropical semiring and the log 
semiring are weakly left divisible since the multiplicative operation, 0, admits 
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Definition 2. A weighted finite-state transducer T over a semiring K is an 
8-tuple T = (E, Z\, Q, /, F, E, A, p) where: 

— E is the finite input alphabet of the transducer; 

— A is the finite output alphabet; 

— Q is a finite set of states; 

— I Q Q the set of initial states; 

— F C Q the set of final states; 

— E G Q X {E U {e}) X (Z\ U {e}) x K x Q a finite set of transitions; 

— A : / ^ K the initial weight function; and 

— p : F ^ K. the final weight function mapping F to K. 

A Weighted automaton A = {E,Q, I , F, E, X, p) is defined in a similar way by 
simply omitting the output labels. 

Given a transition e G A, we denote by fie] its input label, p[e] its origin or 
previous state and n[e] its destination state or next state, w\e\ its weight, o[e] 
its output label (transducer case). Given a state q £ Q, we denote by E[q] the 
set of transitions leaving q. 

A path 7T = Cl • • • Cfc is an element of E* with consecutive transitions: 
n[ei_i] = p[ei], i = 2, . . . ,k. We extend n and p to paths by setting: n[7r] = n[ek] 
and p[tt] = p[ei] . A cycle tt is a path whose origin and destination states coincide: 
n[7r] = p[7t]. We denote by P{q, q') the set of paths from q to q' and by P(q, x, q') 
and P{q,x,y,q') the set of paths from q to q' with input label x G E* and 
output label y (transducer case). These definitions can be extended to subsets 
R, R' C Q, by: P{R, x, R') = g'gfl/P(q, x, q'). The labeling functions i (and 
similarly o) and the weight function w can also be extended to paths by defining 
the label of a path as the concatenation of the labels of its constituent transi- 
tions, and the weight of a path as the G>-product of the weights of its constituent 
transitions: i[7r] = fiefi ■ ■ ■ fick], w[7t] = w[ei] ® ® w[ek\- We also extend w to 

any finite set of paths U by setting: w[n] = automaton A is 

regulated if the output weight associated by A to each input string x G E * : 

|A](a:) = 0 A(p[7t]) (g) w[7t] (g p(n[7r]) 

TreP(I,x,F) 

is well-defined and in K. This condition is always satisfied when A contains no 
e-cycle since the sum then runs over a finite number of paths. It is also always 
satisfied with k-closed semirings such as the tropical semiring [22j. |A](a;) is 
defined to be 0 when P(/, x, F) = 0. 

Similarly, a transducer T is regulated if the output weight associated by T to 
any pair of input-output string (a;, y) by: 

lTl{x,y)= 0 A(p[7t]) g)ui[7r] (gp(n[7r]) 

7T^P{I ,x,y,F) 

is well-defined and in K. |T](a:,y) = 0 when P{I,x,y,F) = 0. In the following, 
we will assume that all the automata and transducers considered are regulated. 
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A successful path in a weighted automaton or transducer M is a, path from an 
initial state to a final state. A state g of M is accessible if q can be reached from 
I. It is coaccessible if a final state can be reached from q. A weighted automaton 
M is trim if there is no transition with weight 0 in M and if all states of M are 
both accessible and coaccessible. M is unambiguous if for any string x G S* there 
is at most one successful path labeled with x. Thus, an unambiguous transducer 
defines a function. 

Note that the second operation of the tropical semiring and the log semiring 
as well as their identity elements are identical. Thus the weight of a path in 
an automaton A over the tropical semiring does not change if A is viewed as a 
weighted automaton over the log semiring or vice-versa. 



3 Edit-Distance of Weighted Automata: Definitions 

Let A be a finite alphabet, and let be defined by 12 = AU{e} x AU{e} — 
{(e, e)}. An element u> of the free monoid il* can be viewed as one of E* x S* 
via the concatenation: to = (oi, 6i) • • • (a„, 6„) ^ (oi • • • a„, 6i • • • 6„). We will 
denote by h the corresponding morphism from fl* to S* x E* and write h(to) = 
{ai---a„,bi---bn). 

Definition 3. An alignment to of two strings x and y over the alphabet E is an 
element o/ 12* such that h{to) = (x,y). 

As an example, (a, e)(b, e)(a, b)(e, b) is an alignment of (aba,bb): 

X = abac 
y = eebb 

Let c\ fl ^ K+ be a function associating some non-negative cost to each element 
of 12, that is to each symbol edit operation. For example c((e, a)) can be viewed 
as the cost of the insertion of the symbol a. Define the cost of w G 17* as the 
sum of the costs of its constituents: w = wq ■ • • G 17*: 0 

n 

c(w) = ^ c{uJi) 

i=0 



Definition 4. The edit-distance d{x,y) of two strings x and y over the alphabet 
E is the minimal east of a sequence of symbol insertions, deletions, or substitu- 
tions transforming one string into the other: 

d{x,y) =min{c(w) : h{uj) = {x,y)} 

^ We are not dealing here with the question of how such weights or costs could be 
defined. In general, they can be derived from a corpus of alignments using various 
machine learning techniques such as for example in |28| . 
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In the classical definition of edit-distance, the cost of all edit operations (inser- 
tions, deletions, substitutions) is one jlS]: 

Va, b G S, c((a, 6)) = 1 if (a 7^ 6) ,0 otherwise 

We will denote by ci this specific edit cost function. The definition of edit- 
distance can be generalized to measure the similarity of two sets of strings X 
and Y by: 

d{X, Y) = inf {d{x, y) : x G X,y gY} 

In some applications such as speech recognition or computational biology, one 
might wish to measure the similarity of a string x with respect to a distribution 
Y of strings y with probability P{y). The edit-distance of cc to F can then be 
defined by the expected edit-distance of x to the strings y: 

d{x,Y) = Ep(^y)[d{x,y)] 

Similarly, the edit-distance of two distributions X and Y is defined by: 

d{X,Y) = Ep(^^y)[d{x,y)] 

We are particularly interested here in the case where these distributions are given 
by weighted automata which is typical in the applications already mentioned. 
More precisely, the corresponding automata are acyclic weighted automata over 
the log semiring]^ Let A\ and A2 be two acyclic weighted automata over the log 
semiring £ defined over the same alphabet S, their edit-distance is thus given 
by: 



d{Ai,A2) = ^exp(-|>li](a;) - lA2j{y))d{x,y) 

x,y 

= ^exp{-(|Ai](x) -h IA2K2/) -logd(x,y))} 

This equation can be rewritten using the operations of the log semiring. 

Definition 5 . The edit-distance of two acyclic weighted automata over the log 
semiring C, Ai and A2, is defined by: 

d{Ai,A2) = exp(-0(|Ai](x) -h lA2j{y) -logd{x,y))) 

log 

■x,V 

We will give an algorithm for computing this edit-distance using general weighted 
automata algorithms. Note that the computation of d(Ai,A2) is not trivial a 
priori since its definition uses operations corresponding to both the tropical 
semiring (min and -I- for computing the edit-distance of two strings) and the log 
semiring (0iog and -I-). 

^ For numerical stability, — log probabilities are used in practice rather than proba- 
bilities. 
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4 Weighted Automata Algorithms 

In this section we give a brief overview of some classical and existing weighted au- 
tomata algorithms such as composition, determinization, and minimization, and 
describe a new and general synchronization algorithm for weighted transducers. 

4.1 Composition of Weighted Transducers 

Composition is a fundamental operation on weighted transducers that can be 
used in many applications to create complex weighted transducers from simpler 
ones. Let K be a commutative semiring and let Ti and T2 be two weighted 
transducers defined over K such that the input alphabet of T2 coincides with 
the output alphabet of Ti. Then, the composition of T\ and T2 is a weighted 
transducer T\ o T2 defined for all x,y by 

|Ti o T2] (a;, y) = 0 Ti (a;, z) 0 T2 (z, y) 

Z 

There exists a general and efficient composition algorithm for weighted trans- 
ducers [ 271231 . States in the composition Ti o T2 of two weighted transducers Ti 
and T2 are identified with pairs of a state of T\ and a state of T2. Leaving aside 
transitions with e inputs or outputs, the following rule specifies how to compute 
a transition of Ti o T2 from appropriate transitions of Ti and T2 0 

(qi,a, 6 ,wi,( 72 ) and (gj , 6 , c, W2, <?2) ^ ((< 7 i> 92), a, c, O u>2, (gl, 92)) 

In the worst case, all transitions of Ti leaving a state qi match all those of T2 
leaving state q [ , thus the space and time complexity of composition is quadratic: 
0 ((|Qi| + |ffi|)(|Q2| + |^'2|))- Figures H] (a)-(c) illustrate the algorithm when 
applied to the transducers of figures[T] (a)-(b) defined over the tropical semiring 
T = T. 

Intersection of weighted automata and composition of finite-state transduc- 
ers are both special cases of composition of weighted transducers. Intersection 
corresponds to the case where input and output labels of transitions are identical 
and composition of unweighted transducers is obtained simply by omitting the 
weights. Thus, we can use both the notation A = Ai n A2 or Ai o A2 for the 
intersection of two weighted automata A\ and A2. A string x is recognized by 
A iff it is recognized by both Ai and A2 and |A](a:) = |Ai](x) ® |A2](a:). 

4.2 Determinization of Weighted Automata 

A weighted automaton is said to be deterministic or subsequential [ 31 j if it has 
a unique initial state and if no two transitions leaving any state share the same 
input label. 

^ Note that we use a matrix notation for the definition of composition as opposed to a 
functional notation. This is a deliberate choice motivated in many cases by improved 
readability. 

® See I 27 I 231 for a detailed presentation of the algorithm including the use of a trans- 
ducer filter for dealing with e-multiplicity in the case of non-idempotent semirings. 
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Fig. 1. (a) Weighted transducer T\ over the tropical semiring, (b) Weighted transducer 
T 2 over the tropical semiring, (c) Construction of the result of composition Ti o T 2 - 
Initial states are represented by bold circles, final states by double circles. Inside each 
circle, the first number indicates the state number, the second, at Hnal states only, the 
value of the final weight function p at that state. Arrows represent transitions and are 
labeled with symbols followed by their corresponding weight. 



There exists a natural extension of the classical subset construction to the 
case of weighted automata over a weakly left divisible semiring called deter- 
minization [20jFI The algorithm is generic: it works with any weakly left di- 
visible semiring. Figures |2] (a)-(b) illustrate the determinization of a weighted 
automaton over the tropical semiring. A state r of the output automaton that 
can be reached from the start state by a path tt corresponds to the set of pairs 
{q,x) G Q X K such that q can be reached from an initial state of the original 
machine by a path a with l[a] = 1[tt] and A(p[ct]) 0 w[a] = A(p[7r]) 0 w[7r] 0 x. 
Thus, X is the remaining weight at state q. 

Unlike the unweighted case, determinization does not halt for some input 
weighted automata. In fact, some weighted automata, non subsequentiable au- 
tomata, do not even admit equivalent subsequential machines. We say that a 
weighted automaton A is determinizable if the determinization algorithm halts 
for the input A. With a determinizable input, the algorithm outputs an equiva- 
lent subsequential weighted automaton m- 

There exists a sufficient condition, necessary and sufficient for unambiguous 
automata, for the determinizability of weighted automata over a tropical semir- 
ing based on a twins property. There exists an efficient algorithm for testing the 
twins property for weighted automata [T]. In particular, any acyclic weighted 
automaton has the twins property and is determinizable. 



We assume that the weighted automata considered are all such that for any string 
X G S* , w[P{I,x,Q)] A 0- This condition is always satisfied with trim machines 
over the tropical semiring or any zero-sum-free semiring. 
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Fig. 2. (a) Weighted automaton A over the tropical semiring, (b) Equivalent subse- 
quential weighted automaton A2 over the tropical semiring constructed by the deter- 
minization algorithm. 



4.3 Synchronization 

In this section, we present a general algorithm for the synchronization of weighted 
transducers. Roughly speaking, the objective of the algorithm is to synchronize 
the consumption of non-e symbols by the input and output tapes of a transducer 
as much as possible. 

A synchronization procedure was first presented by m- The first explicit 
synchronization algorithm was given by m for transducers with bounded delay, 
later extended by |2] to transducers with constant emission rate. The algorithm 
of m applies only to transducers with input labels different from the empty 
string e. We present a simple synchronization algorithm that is not restricted to 
these transducers, and that applies more generally to weighted transducers with 
bounded delays defined over a semiring K. 

Definition 6. The delay of a path tt is defined as the difference of length between 
its output and input labels: 



d[Tr] = |o[7t]| - |f[7r]| 

The delay of a path is thus simply the sum of the delays of its constituent 
transitions. A trim transducer T is said to have bounded delays if the delay 
along all paths of T is bounded. We then denote by d[T] > 0 the maximum delay 
in absolute value of a path in T. The following lemma gives a straightforward 
characterization of transducers with bounded delays. 

Lemma 1. A transducer T has bounded delays iff the delay of any eyele in T 
is zero. 

Proof. If T admits a cycle tt with non-zero delay, then d[T] > |d[7r”]| = u|fi[7r]| is 
not bounded. Conversely, if all cycles have zero delay, then the maximum delay 
in T is that of the simple paths which are of finite number. □ 
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We define the string delay of a path tt as the string s[7t] defined by: 

s[7t] = suffix of o[7t] of length |d[7r]| if d[7r] > 0 
suffix of i[7r] of length |d[7r]| otherwise 

and for any state q G Q, the string delay at q by the set of string delays of 
the paths from an initial state to q: 

s[g] = {s[7t] : 7T G P(I,q)} 

Lemma 2. IfT has hounded delays then the set s[g] is finite for any q G Q. 

Proof. The lemma follows immediately the fact that the elements of s[< 7 ] are all 
of length less than d[T], □ 

A weighted transducer T is said to be synchronized if along any successful 
path of T the delay is zero or varies strictly monotonically. An algorithm that 
takes as input a transducer T and computes an equivalent synchronized trans- 
ducer T' is called a synchronization algorithm. We present a synchronization 
algorithm that applies to all weighted transducers with bounded delays. The 
following is the pseudocode of the algorithm. 



Synchronization(T) 

1 F' ^Q' ^ E' 

2 S^i' ^{{i,t,e)-.iGl} 

3 while S' 7 ^ 0 

4 do p' = (q,x,y) ^ head{S); Dequeue(S) 

5 if (q G F and |®| -|- |y| = 0) 

6 then F' ^ F' U {p'}\ p'{p') <— p(q) 

7 else if {q G F and \x\ F \y\ > 0) 

8 then q' ^ (/, cdr{x), cdrfy)) 

9 E' ^ E' \J (p' , car{x), car(y), p[q],q') 

10 if {q' Q') 

11 then Q' ^ Q' U {g'}; Enqueue(S, <j') 

12 for each e G E[q\ 

13 do if (|xi[e]| >0 and |yo[e]| >0) 

14 then q' ^ {n[e], cdr(xi[e])., cdr{y o\e])) 

15 A' <— A' U {(p'j car{xi[e]), car{y o[e]),w[e],q')} 

16 else q' {n[e],xi[e],yo[e]) 

17 E' ^ E' U{{p',e,e,w[e],q')} 

18 if (g'^Q') 

19 then Q' ^ Q' U {g'}; Enqueue(S, g') 

20 return T' 



To simplify the presentation of the algorithm, we introduce a new state / 
and add it to Q and F with p[f] = 1 and E[f] = 0. We denote by car(x) the 
first symbol of a string a; if x is not empty, e otherwise, and denote by cdr{x) 
the suffix of X such that x = car(x) cdr(x). 
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a:b/2 




a:b/2 




Fig. 3. (a) Weighted automaton Ai over the tropical semiring, (b) Equivalent synchro- 
nized weighted automaton A 2 . (c) Synchronized weighted automaton A 3 equivalent to 
Ai and A 2 obtained by e-removal from A 2 . 



Each state of the resulting transducer T' corresponds to a triplet ( 9 , x, y) 
where g S Q is a state of the original machine T and where x S S* and y G A* 
are strings over the input and output alphabet of T. 

The algorithm maintains a queue S that contains at any time the set of 
states of T' to examine. Each time through the loop of the lines 3-19, a new 
state p' = (q,x,y) is extracted from S (line 4) and its outgoing transitions are 
computed and added to E' . The state p' is final iff q is final and x = y = e and 
in that case the final weight at p' is simply the final weight at the original state 
q (lines 5-6). If q is final but the string x and y are not both empty, then the 
algorithm constructs a sequence of transitions from p' to (/, e, e) to consume the 
remaining input and output strings x and y (lines 7-11). 

For each transition e of g an outgoing transition e' is created for p' with 
weight ic[e]. The input and output labels of e' are either both e if xi[e] or yo[e] 
is the empty string, the first symbol of these strings otherwise, and the remaining 
suffixes of these strings are stored in the destination state q' (lines 12-19). Note 
that in all cases, the transitions created by the steps of the algorithm described in 
lines 14-17 have zero delay. The state q' is inserted in S if it has never been found 
before (line 18-19). Figures[3](a)-(b) illustrate the synchronization algorithm just 
presented. 

Lemma 3. Let (g, x, y) correspond to a state of T' created by the algorithm. 
Then, either x = e or y = e. 

Proof. Let p' = (g, x, y) be a state extracted from S, it is not hard to verify that 
\l X = e or y = e, then the destination state of a transition leaving p' created by 
the algorithm is of the form q' = {r,e,y') or q' = (r,x',e). Since the algorithm 
starts with the states (i, e, e), i G I, hy induction, for any state p' = (q,x,y) 
created either x = e or y = e. □ 
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Lemma 4. Let tt' be a path in T' created by the synchronization algorithm such 
that n[7r'] corresponds to {q',x',y') with q' ^ f. Then, the delay of n' is zero. 

Proof. By construction, the delay of each transition created at lines 14-17 is 
zero. Since the delay of a path is the sum of the delays of its transitions, this 
proves the lemma. □ 



Lemma 5. Let {q',x',y') correspond to a state of T' created by the algorithm 
with q' f . Then, either x' € s[g'] or y' £ s[q']. 

Proof. By induction on the length of tt', it is easy to prove that there is a path 
7t' from state {q, x, y) to (g', x' , y') iff there is a path from q to q' with input label 
x~^i[’K']x' , output label y“^o[7r']t/', and weight ry[7r']. 

Thus, the algorithm constructs a path tt' in T' from (i,e, e), i G /, to 
{q' ,x' ,y'), q' yf / iff there exists a path tt in T from i to q' with input label 
*[7r] = i['K']x' , output label o[7t] = o[7r']y' and weight ic[7r] = w[7r']. By lemma|4] 
|i[7r'] I = |o[7t'] |. Thus, if x' = e, y' is the string delay of tt. Similarly, if y' = e, x' is 
the string delay of tt. By lemma[3l x' = e or y' = e, thus y' G s[g] or x' G s[g]. □ 



Theorem 1. The synchronization algorithm presented terminates with any in- 
put weighted transducer T with bounded delays and produces an equivalent syn- 
chronized transducer T' . 

Proof. By lemmas 0 and E] if {q',x' ,y') is a state created by the algorithm with 
q' ^ f , then either x' = e and y' G s[g] or y' = e and x' G s[g]. If T has bounded 
delays, by lemma[2]s[g] is finite, thus the algorithm produces a finite number of 
states of the form {q',x',y') with q' ^ f. 

Let (q,x,e) be a state created by the algorithm with q G F and |x| > 0. 
x = xi ■ ■ ■ Xn is thus a string delay at q. The algorithm constructs a path from 
(g, X, e) to (/, e, e) with intermediate states (/, Xi - ■ ■ Xn, e). Since string delays are 
bounded, at most a finite number of such states are created by the algorithm. 
A similar result holds for states (g, e, y) with q G F and \y\ > 0. Thus, the 
algorithm produces a finite number of states and terminates if T has bounded 
delays. 

By lemma 0 paths tt' in T' with destination state (q,x,y) with q ^ f have 
zero delay and the delay of a path from a state (/, x, y) to (/, e, e) is strictly 
monotonic. Thus, the output of the algorithm is a synchronized transducer. 
This ends the proof of the theorem. □ 

The algorithm creates a distinct state (g, x, e) or (g, e, y) for each string delay 
x,y G s[g] at state qf^f. The paths from a state (g, x,e) or (q,e,y), q G F, to 
(/, e, e) are of length |x| or \y\. The length of a string delay is bounded by d[T]. 
Thus, there are at most -\- distinct string 

delays at each state. Thus, in the worst case, the size of the resulting transducer 
T' is: 



0((|Q| + |£;|)(|A|"[^] + 
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The string delays can be represented in a compact and efficient way using a suffix 
tree. Indeed, let f7 be a tree representing all the input and output labels of the 
paths in T found in a depth-first search of T. The size of U is linear in that of T 
and a suffix tree V oiU can be built in time proportional to the number of nodes 
of U times the size of the alphabet [T5l, that is in 0((|T'| -h |Z\|) • (|Q| -h \E\)). 
Since each string delay a; is a suffix of a string represented by U, it can be 
represented by two nodes ni and ri 2 of V and a position in the string labeling 
the edge from ni to ri 2 . The operations performed by the algorithm to construct 
a new transition require either computing xa or a~^x where a is a symbol of the 
input or output alphabet. Clearly, these operations can be performed in constant 
time: xa is obtained by going down one position in the suffix tree, and a~^x by 
using the suffix link at node rii. Thus, using this representation, the operations 
performed for the construction of each new transition can be done in constant 
time. This includes the cost of comparison of a newly created state (q', x' , e) with 
an existing state (g, x, e), since the comparison of the string delays x and x' can 
be done in constant time. Thus, the worst case space and time complexity of the 
algorithm is: 

This is not a tight evaluation of the complexity since it is not clear if the worst 
case previously described can ever occur, but the algorithm can indeed produce 
an exponentially larger transducer in some cases. 

Note that the algorithm does not depend on the queue discipline used for S 
and that the construction of the transitions leaving a state p' = (g, x, y) of T' 
only depends on p' and not on the states and transitions previously constructed. 
Thus, the transitions of T' can be naturally computed on-demand. We have 
precisely given an on-the-fly implementation of the algorithm and incorporated 
it in a general-purpose finite-state machine library (FSM library) |24l2bj . Note 
also that the additive and multiplicative operations of the semiring are not used 
in the definition of the algorithm. Only 1, the identity element of O, was used 
for the definition of the final weight of /. Thus, to a large extent, the algorithm 
is independent of the semiring K. In particular, the behavior of the algorithm 
is identical for two semirings having the same identity elements, such as for 
example the tropical and log semirings. 



4.4 e-Removal 

The result of the synchronization algorithm may contain e-transitions (transi- 
tions with both input and output empty string) even if the input contains none. 
An equivalent weighted transducer with no e-transitions can be computed from 
T' using a general e-removal algorithm |2I]. Figure 0(c) illustrates the result of 
that algorithm when applied to the synchronized transducer of figure 0 (b) . 

Since e-removal does not shift input and output labels with respect to each 
other, the result of its application to T' is also a synchronized transducer. 

Note that the synchronization algorithm does not produce any e-cycle if the 
original machine T does not contain any. Thus, in that case, the computation of 
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the e-closures in T can be done in linear time |21J and the total time complexity 
of e-removal is 0{\Q'\'^ + (T 0 -|- T 0 )|Q'| • where T® and denote the cost 
of the 0 , 0 operations in the semiring K. Also, on-the-fly synchronization can be 
combined with on-the-fly e-removal to directly create synchronized transducers 
with no e-transition on-the-fly. 

A by-product of the application of synchronization followed by e-removal is 
that the resulting transducer is normalized. 

Definition 7. Let tt and tt' he two paths of a transducer T with the same input 
and output labels: *[7r] = and o[7t] = o[7r']. We will say that tt = ei • • • e„ and 
tt' = e\ ■ ■ ■ e'^i are identical if they have the same number of transitions (n = n' ) 
with the same labels: i[ek] = *[ey and o[ek] = o[e'j,] for k = 1, . . . ,n. T is said to 
be normalized if any two paths tt and tt' with the same input and output labels 
are identical. 

Note that the definition does not require the weights of two identical paths to 
be the same. 

Lemma 6. Let T be a synchronized transducer and assume that T has no e- 
transition. Then, T is normalized. 

Proof. Let tt and tt' be two paths with the same input and output labels. Since 
T is synchronized and has no e-transitions, tt and tt' have the same delay and 
more precisely the delay varies in the same way along these two paths and are 
thus identical. □ 

5 Computation of the Edit-Distance of Unweighted 
Automata 

The edit-distance d{X, Y) of two sets of strings X and Y each represented by 
an unweighted automaton can be computed using the general algorithm of com- 
position of transducers and a single-source shortest-paths algorithm |^. The 
algorithm applies similarly in the case of an arbitrarily complex edit-distance 
defined by a weighted transducer over the tropical semiring. 

Let Ai and A 2 be two (unweighted) automata representing the sets X and 
Y. By definition, the edit-distance of X and Y, or equivalently that of Ai and 
A 2 , is defined by: 

d(Ai, A 2 ) = inf {d{x, y) : x G Dom{Ai),y G Dom{A 2 )} 



5.1 Alignment Costs in the Tropical Semiring 

Let be the formal power series defined over the alphabet f2 and the tropical 
semiring by: (a, b)) = c((a, b)) for (a, b) G f2. 

Lemma 7. Let oj = (ao, bo) • ■ ■ (un, bn) G f2* be an alignment, then {T'*,ui) is 
exactly the cost of the alignment to. 
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Fig. 4. Weighted transducer over the tropical semiring representing IF* with the edit 
cost function ci and E = {a, b}. 



Proof. By definition of the -I— multiplication of power series in the tropical semir- 
ing: 

(tf'*,w)= min {>P,uo)-\ \- {'P,Uk) 

Uq • • • Ufc — CJ 

= (^) (flO) ^o)) + ■ ■ ■ + (^) (an,b„)) 

n 

= '^c{{a„h)) = c(w) 

i=0 

This proves the lemma. □ 

'P* is a rational power series as the closure of the polynomial power series P mM- 
Thus, by the theorem of Schiitzenberger 130], there exists a weighted automaton 
A defined over the alphabet 17 and the semiring T realizing P*. A can also be 
viewed as a weighted transducer T with input and output alphabets S. Figure 
10] shows the simple finite-state transducer T realizing P* in the particular case 
of the edit cost function ci and with E = {a, b}. 

5.2 Algorithm 

By definition of composition of transducers and by lemma [T] the weighted trans- 
ducer Ai oT o A 2 contains a successful path corresponding to each alignment oj 
of a string accepted by Ai and a string accepted by A 2 and the weight of that 
path is c{uj). 

Theorem 2. Let U be the weighted transducer over the tropical semiring ob- 
tained by: U = Ai oT o A 2 - Let tt be a shortest path of U from the initial state 
to the final states. Then, tt is labeled with one of the best alignments of a string 
accepted by Ai and a string accepted by A 2 and: d(Ai, A 2 ) = w[tt]. 

Proof. The result follows directly the previous remark. □ 

The theorem provides an algorithm for computing the best alignment between 
the strings of two unweighted automata Ai and A 2 and for computing their edit- 
distance d{Ai, A 2 ). Any single-source shortest-paths algorithm applied to U can 
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be used to compute the edit-distance and a best alignment. Let \V\ denote the 
sum of the number of states and transitions of an automaton or transducer V. 
Since the worst case time and space complexity of composition is quadratic, we 
have: \U\ = 0{\Ai\\A2\). 

When U is acyclic, this is the case in particular when Ai and A 2 are both 
acyclic, the total time complexity of the computation of the best alignment and 
the edit-distance d{Ai, A 2 ) is 0(|Ai||A2|) since we can then use Lawler’s linear- 
time single-source shortest paths algorithm mm- In the general case, the total 
complexity of the algorithm is 0(\E\ -|- |Q| log |Q|), where E denotes the set of 
transitions and Q the set of states of U , using Dijkstra’s algorithm implemented 
with Fibonacci heaps 051. 

In particular, the time complexity of the computation of the edit-distance for 
two strings x and y is 0(|a:||?/|). The classical dynamic programming algorithm 
for computing the edit-distance of two strings can in fact be viewed as a special 
instance of the more general algorithm just presented. 

This algorithm is very general. It extends to the case of automata the classical 
edit-distance computation and it also generalizes the classical definition of edit- 
distance. Indeed, any weighted transducer with non-negative weights can be used 
here without modifying the algorithm. 0 Edit-distance transducers with arbitrary 
topologies, arbitrary number of states and transitions can be used instead of 
the specific one-state edit-distance transducer used in most applications. More 
general transducers assigning non-negative costs to transpositions or to more 
general weighted context-dependent rules [I25j can be used to model complex 
edit-distances. 

6 Computation of the Edit-Distance of Weighted 
Automata 

Our algorithm for computing the edit-distance of two weighted automata is based 
on weighted composition, determinization, e-removal, and synchronization. More 
precisely, our algorithm computes -log of that edit-distance: 

-log(d(Ai,yl 2 )) = 0 I^il(a:) -f 1 ^ 2 ! (y) - log d{x,y) 

log 

We first show that the cost of the alignment of two strings can be computed 
using a simple weighted transducer over the log semiring. 



^ Transducers with negative weights can be used as well, but the single-source shortest- 
paths problems of Bellman-Ford would need to be used then in general, which can 
make the algorithm less efficient in general. 
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Fig. 5. Weighted transducer over the log semiring representing S with the edit cost 
function ci and E = {a, t>}. The transition weights and the final weight at state 1 are 
all equal to 0 , since — log(ci(a:, y)) = 0 for a: 7 ^ j/. 



6.1 Alignment Costs in the Log Semiring 

Let be the formal power series defined over the alphabet f? and the log semiring 
by: (if', (a, 6)) = — log(c((a, &))) for (a,b) G f?, and let S be the formal power 
series S over the log semiring defined by: 

s = n* + w + n* 

5" is a rational power series as a -I— product and closure of the polynomial power 
series 17 and S' ESE]. Thus, by the theorem of Schiitzenberger m, there exists a 
weighted automaton A defined over the alphabet 17 and the semiring C realizing 
S. A can also be viewed as a weighted transducer T with input and output 
alphabets E. Figure [5] shows the simple finite-state transducer T realizing S in 
the particular case of the edit cost function ci and with S = {a, b}. 

Lemma 8. Let io = (oq, &o) ' ' ' (on, bn) G 17* be an alignment, then {S,lo) is 
equal to -log of the cost of lo. 

Proof. By definition of the -I— multiplication of power series in the log semiring: 
( 5 , 0 ;)= 0 {Q*,u) + {^,{a,,b,)) + {n*,v) 

log 

u {ai,bi) v—uj 

n n 

= 0-log(c((ai,bi))) = -log(^c((a„b*))) = -logc(w) 

log 2 = 0 

2^0 

This proves the lemma. □ 

The lemma is a special instance of a more general property that can be easily 
proved in the same way: given an alphabet E and a rational set X C E*, the 
power series E* + X + E* over the log semiring is rational and associates to each 
string X G E* -log of the number of occurrences of an element of X in x. 
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6.2 Algorithm 

Let Ai and be two acyclic weighted automata defined over the alphabet 
S and the log semiring £, and let T be the weighted transducer over the log 
semiring associated to S. 

Let M = Ai o T o A2- M can be viewed as a weighted automaton over the 
alphabet 17. 

Lemma 9. Let uj £ f 2 * be an alignment sueh that h(w) = (x, y) with x S 
Dom{Ai) and y € Dom(A2), then: 

lMj{w) = -logc(o;) + |Ai](a;) + [AaKa;) 

Proof. By definition of composition, |M](rt;) represents the value associated by 
S to the alignment uj with weight W = |Ai](x) + |A2](a::). By lemma[8]and the 
definition of power series, S associates to an alignment u> with weight W the 
following: —log c(o;) + bb. □ 

The automaton M may contain several paths labeled with the same align- 
ment Lu. M is acyclic as the result of the composition of T with acyclic automata, 
thus it can be determinized. Denote by det£(M) the result of that determiniza- 
tion in the log semiring. By definition of determinization, det£(M) is equivalent 
to M but contains exactly one path for each alignment to between two strings 
X G Dom{Ai) and y G L>om(A2). 

We need to keep for each pair of strings x and y only one path, the one cor- 
responding to the alignment oj of x and y with the minimal cost c{oj) or equiv- 
alently maximal |M](r(;). We will use determinization in the tropical semiring, 
detT, to do so. However, to apply this algorithm we first need to ensure that the 
transducer is normalized so that paths corresponding to different alignments uj 
but with the same /i(w) be merged by the automata determinization detT-. By 
lemmaini one way to normalize the automaton consists of using the synchroniza- 
tion algorithm, synch, followed by e-removal in the log semiring, rme£. 

Theorem 3. Let N be the deterministic weighted automaton defined by: 

N = — det(— rme(synch(det(Ai oT o A2)))) 

Then for any x G Dom(Ai) and y G Dom(A2): 

|Al(a:,j/) = [All (a:) -k lA2]{y) - log d{x,y) 

Proof. Let a; G 17* be an alignment such that h{w) = (x,y) with x G Dom{Ai) 
and y G Dom{A2), then, by lemma[Hl 

rme(synch(det(Ai o T o A2)))(w) = |Ai](a;) -k |A2](j/) - logc(w) 

Since rme£(synch(det£(Ai oTo A2))) is normalized, by definition of determiniza- 
tion in the tropical semiring, for any x G Dom{Ai) and y G Dom{A2): 

|fVl(a;,2/) = inax [Ai](a:) -k lA2]{y) - logc(o;) 

h{uj)^{x,y) 

= [All (a;) -k IA2] (y) - log d{x, y) 



□ 
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Since N is deterministic when viewed as a weighted automaton, the shortest 
distance from the initial state i to the final states F in the log semiring is 
exactly what we intended to compute: 

0 w[ii\ = 0 I^iKa;) -h lA2j{y) - log d{x, y) = - log{d{Ai, A2)) 

7T^P{i,F) log 

This shortest distance can be computed in linear time using a generalization of 
the classical single-source shortest paths algorithm for acyclic graphs | 22 j . Thus, 
the theorem shows that the edit-distance of two automata A\ and A2 can be 
computed exactly using general weighted automata algorithms. 

The worst case complexity of the algorithm is exponential but in practice 
several techniques can be used to improve its efficiency. First, a heuristic prun- 
ing can be used to reduce the size of the original automata Ai and A2 or that 
of intermediate automata and transducers in the algorithm described. Addi- 
tionally, weighted minimization in the tropical and log semirings [I20j can be 
used to optimally reduce the size of the automata after each determinization. 
Finally, the automaton A is not determinizable in the log semiring but it can 
be approximated by a deterministic one for example by limiting the number of 
insertions, deletions or substitutions to some large but fixed number or by using 
e-determinization | 20| . The advantage of a deterministic A is that it is unam- 
biguous and thus it leads to an unambiguous machine M in the sense that no 
two paths of M correspond to the same alignment. Thus, it is not necessary to 
apply determinization in the log semiring, det£, to M. 



6.3 Edit-Distance Weighted Automaton 

In some applications such as speech recognition, one might wish to compute not 
just the edit-distance of Ai and A2 but an automaton A3 accepting exactly the 
same strings as Ai and such that the weight associated to x S Dom{A^) is -log 
of the expected edit-distance of x to A2: |A3](x) = — logd(x, A2). In such cases, 
the automaton Ai is typically assumed to be unweighted: |Ai](x) = 0 for all 
X e Dom{Ai). 

More precisely, A2 is then the weighted automaton, or word lattice, output 
of the recognizer, and the weight of each sentence is -log of the probability of 
that sentence given the acoustic information. However, the word-accuracy of a 
speech recognizer is measured by computing the edit-distance of the sentence 
output of the recognizer and the reference sentence This motivates the 

algorithm presented in this section. Assuming that all candidate sentences are 
represented by some automaton Ai (Ai could represent all possible sentences 
for example or just the sentences accepted by A2), one wishes to determine for 
each sentence in Ai its expected edit-distance to A2 and thus to compute A3. 

Let proj]^ be the operation that creates an acceptor from a weighted trans- 
ducer by removing its output labels. The following theorem gives the algorithm 
for computing A3 based on classical weighted automata algorithms. 
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Theorem 4. Let A\ be an unweighted automaton and A2 an aeyelic weighted 
automaton over the log semiring. Then the edit-distanee automaton A3 ean be 
computed as follows from N : 



^3 = det(proj(7V)) 

^ 1 

Proof. Since Ai is unweighted, by theorem El for any x G Dom{Ai) and y G 
Dom{A2): 

1^1 (x,y) = IA2I (y) - log d{x, y) 

To construct A3 we can omit the output labels of N . proj3(./V) may have several 
paths labeled with the same input x. If we apply weighted determinization in 
the log semiring to it, then by definition the weight of a path labeled with x will 
be exactly: 

0 I^2l(?/) -logd(a:,?/) = -log[^ exp(-|A2l(y) + log d(a;, y))] 

log yeDom{A2) 

yGDom(A2) 

= -logE exp(-|A2l(y))d(a;,2/)] 

yGDom{A2) 

= -logd{x,A2) 



This proves the theorem. □ 

The weighted automaton A3 can be further minimized using weighted minimiza- 
tion to reduce its number of states and transitions |20| . 

In the log semiring, the weight associated to an alignment with cost zero is 
00 = — log 0. Thus, paths corresponding to the best alignments would simply 
not appear in the result. To avoid this effect, one can assign an arbitrary large 
cost to perfect alignments. 

In speech recognition, using a sentence with the lowest expected word error 
rate instead of one with the highest probability can lead to a significant improve- 
ment of the word accuracy of the system That sentence is simply the 

label of a shortest path in A3 and can therefore be obtained from A3 efficiently 
using a classical single-source shortest-paths algorithm. 

Speech recognition systems often use a rescoring method. This consists of 
first using a simple acoustic and grammar model to produce a word lattice or 
n-best list, and then to reevaluate these alternative hypotheses with a more 
sophisticated model or by using information sources of a different nature. The 
weighted automaton or word lattice A3 can be used advantageously for such 
rescoring purposes. 

7 Conclusion 

Algorithms for computing the edit-distance of unweighted and weighted au- 
tomata were given. These algorithms are based on general and efficient weighted 
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automata algorithms over different semirings and classical single-source shortest- 
paths algorithms. They demonstrate the power of automata theory and semiring 
theory and provide a complex example of the use of multiple semirings in a single 
application. 

The algorithms presented have applications in many areas ranging from text 
processing to computational biology. They can lead to significant improvements 
of the word accuracy in large- vocabulary speech recognition as shown by several 
experiments [32119112] . Recently, several kernels were introduced in computa- 
tional biology for input vectors representing biological sequences [14j35|9j . These 
string kernels are specific instances of the more general class of rational kernels 
jH] and can all be computed efficiently using the general algorithms presented in 
sections 5 and 6. They can be generalized to deal properly with probabilistic or 
weighted sequences. An algorithm for computing such generalized kernels based 
on general weighted automata algorithms was given in section 6. 



Acknowledgments. 1 thank Cyril Allauzen and Michael Riley for discussions 
about this work. 
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Abstract, p-subsequential transducers are efficient finite-state trans- 
ducers with p final outputs used in a variety of applications. Not all 
transducers admit equivalent p-subsequential transducers however. We 
briefly describe an existing generalized determinization algorithm for p- 
subsequential transducers and give the first characterization of p-subsequ- 
entiable transducers, transducers that admit equivalent p-subsequential 
transducers. Our characterization shows the existence of an efficient al- 
gorithm for testing p-subsequentiability. We have fully implemented the 
generalized determinization algorithm and the algorithm for testing p- 
subsequentiability. We report experimental results showing that these 
algorithms are practical in large-vocabulary speech recognition applica- 
tions. The theoretical formulation of our results is the equivalence of the 
following three properties for finite-state transducers: determinizability 
in the sense of the generalized algorithm, p-subsequentiability, and the 
twins property. 



1 Introduction 

Finite-state transducers are automata in which transitions are labeled with both 
an input and an output symbol. Transducers have been used successfully to cre- 
ate complex systems in many applications such as text and language processing, 
speech recognition and image processing \mmmi 

The time efficiency of such systems is substantially increased when subse- 
quential transducers [IS], i.e. finite-state transducers with deterministic input, 
are used. Subsequential machines can be generalized to p-subsequential trans- 
ducers which are transducers with deterministic input with p, {p > 1), final 
output strings PI. This generalization is necessary in many applications such 
as language processing to account for finite ambiguities m- 

Not all transducers admit equivalent p-subsequential transducers however. 
We present the first characterization of p-subsequentiable transducers, i.e. trans- 
ducers that admit equivalent p-subsequential transducers. Our characterization 
is based on the twins property and leads to an efficient algorithm for testing 
p-subsequentiability. More generally, our results show the equivalence of the fol- 
lowing three fundamental properties for finite-state transducers: determinizabil- 
ity in the sense of a generalized algorithm, p-subsequentiability, and the twins 
property. 
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This can also be viewed as a generalization of the results known in the case of 
functional transducers: determinizable functional transducers are exactly those 
that admit equivalent subsequential transducers . We generalize these results 
by relaxing the condition on functionality: determinizable transducers are exactly 
those that admit equivalent p-subsequential transducers and exactly those that 
admit the twins property. 

We have fully implemented the generalized determinization algorithm men- 
tioned above and the algorithm for testing p-subsequentiability. We report exper- 
imental results showing that these algorithms are practical in large-vocabulary 
speech recognition applications. 

We first introduce the notation used in the rest of this paper, then briefly 
describe a generalized determinization algorithm for p-subsequential transducers 
introduced by m , present a fundamental characterization theorem, and describe 
our experimental results. 

2 Preliminaries 

Definition 1. A finite-state transducer T = (If, Z\, Q, I, F, E, A, p) is an 8-tuple 
where E is a finite input alphabet, A a finite output alphabet, Q a finite set of 
states, I ^ Q the set of initial states, F C Q the set of final states, E C 
Q X E X (AU {e}) x Q a finite set of transitions, A : / — > A* the initial output 
function mapping I to A* , and p : E ^ 2^ the final output function mapping 
each state q G E to a finite subset of A* . 

Given a transition e S if, we denote by i[e] its input label, p[e] its origin 
or previous state and n[e] its destination state or next state, o[e] its output 
label. Given a state q G Q, we denote by E[q] the set of transitions leaving 
q. We extend the definitions of i, n, p, and E to sets in the following way: 
i[UkeKek] = and similarly for n, p, and E. 

A path 7T = ei • • • Cfc in T is an element of E* with consecutive transitions: 
= p[ci], i = 2, . . . ,k. We extend n and p to paths by setting: n[7r] = n[ek] 
and p[7t] = p[ei]. We denote by P{q,q') the set of paths from q to q' and by 
P{q, X, q') the set of paths from q to q' with input label x G E* . These definitions 
can be extended to subsets R,R' C Q, by: P{R,x,R') = UgGfl q'eR' 

The labeling functions i and o can also be extended to paths by defining the 
label of a path as the concatenation of the labels of its constituent transitions: 

i[n] = i[e{\ ■ ■ ■ i[ek\ o[k] = o[ei] • • • o[ek] 

The set of output strings associated by a transducer T to an input string x G E* 
is defined by: 

iT'Ka;) = U HpM) o[tt] p{n[TT]) 

ireP(I,x,F) 

|T](a:) = 0 when P{I,x,P) = 0. The domain of definition of T is defined as: 
Dom{T) = {x G E* : |T](a;) 0}. A transducer is said to be p-functional for 

some integer p if it associates at most p strings to each input string, that is if 



26 



C. Allauzen and M. Mohri 



||T](x)| < p for any x G S*. Two transducers T and T' are equivalent when 

|Ti = iri 

A successful path in a transducer T is a path from an initial state to a final 
state. A state a G Q is accessible if q can be reached from I. It is coaccessihle 
if a final state can be reached from q. T is trim if all the states of T are both 
accessible and coaccessible. T is unambiguous if for any string x G S* there is 
at most one successful path labeled with x. An unambiguous transducer is thus 
p-functional, with p = |p(<?)|- 

A transducer T is said to be p-subsequential P] for some integer p if it has 
a unique initial state, if no two transitions leaving the same state share the 
same input label and if there are at most p final output strings at each final 
state: |p(/)| < p for all / G F. T is said to be p-subsequentiable if there exists a 
p-subsequential transducer T' equivalent to T. 

Given two strings x and y in S* , we say that y is a suffix of x if there 
exists z G S* such that x = zy and similarly that y is a prefix of x if there 
exists z such that x = yz. We denote hy x f\ y the longest common prefix of 
X and y and denote by |a:| the length of a string x £ S* . We extend S by 
associating to each symbol a G F a new symbol denoted by a~^ and define 
as: = {a~^ : a G E}. X = {E U E~^)* is then the set of strings written 

over the alphabet (27U 27“^). If we assume that aa~^ = a~^a = e, then X forms 
a group called the free group generated by E and is denoted by E^*\ Note that 
the inverse of a string a: = ai • • • is then x~^ = ■ ■ ■ af^. The formula used 

in our definitions, theorems and proofs should be interpreted as equations in the 
free group generated by E*. 

3 General Determinization Algorithm with 
p-Subsequential Outputs 

In this section, we give a brief description of a general determinization algo- 
rithm introduced by | 10| that takes as input a transducer T and outputs a p- 
subsequential transducer T' = (E,A,Q',{i’},F',E',\',p’). A transducer T for 
which the algorithm terminates and thus generates an equivalent p-subsequential 
transducer is said to be determinizable. 

The algorithm is a generalization of the subset construction used in the de- 
terminization of finite automata. A state in the output transducer T' is a set of 
pairs (g, z) where q is a state of the input transducer T and z G E* a remainder 
output string with the following property: if a state q' in T' containing a pair 
(q, z) can be reached from the initial state by a path with input x and output 
y, then q can be reached in T from an initial state by a path with input x and 
output yz. 

The pseudocode of the algorithm is given below. Line 1 initializes the set of 
states, final states, and transitions of T' to the empty set. The algorithm uses 
a queue S containing the set of states of T' to be considered next. S initially 
contains the unique initial state of T', i', which is the set of pairs of an initial 
state i of T and the corresponding initial output string A(i) (line 2). 
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(a) (b) 

Fig. 1. Generalized determinization of finite-state transducers, (a) Non-deterministic 
transducer, (b) Construction of equivalent 2-subsequential transducer. 



Transducer-Determinization(T) 

1 F' ^ Q' ^ F' ^ 0 

2 S^i' 

3 while S' 7 ^ 0 

4 do p' ^ head{S) 

5 Dequeue(S) 

6 for each x G i[F[Q[p']]] 

7 do y' ^ /\{zy: {p,z) € p',{p,x,y,q) e E} 

8 q' ^ {{q, y'~^ Z y) {p, z) e p', (p, x, y, q) G E} 

9 E' ^ E’u{{p',x,y',q')} 

10 if iq'^Q') 

11 then Q' <— Q' U {g'} 

12 if Q[q']nFj^iD 

13 then F' ^ F' U {q'} 

14 P'{q') ^\J{zp{q) ■■ (q,z) G q',q G F} 

15 Enqueue(S, q') 

16 return T' 



Each time through the loop of lines 3-15, a new subset p' (or equivalently a 
new state of T') is extracted from S. The algorithm then creates (lines 6-9) a 
transition with input label x G S and output label y' G S* leaving p' if there 
exists at least one pair (p, z) G p' such that p admits an outgoing transition 
with input label x and output label y. y' is then defined as the longest common 
prefix of all such zy's. The destination state q' of that transition is the subset 
containing the pairs (q, y'~^zy) such that (p, z) G p' and (p, x, y, q) is a transition 
in E. If the destination state q' is new, it is added to Q' (lines 10-11). q' is a 
final state if it contains at least one pair {q, z), q being a final state. Its final set 
of output strings is then the union of zp{q) over all such pairs {q, z). 
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Fig. 2. Non-determinizable case, (a) A non-determinizable finite-state transducer; 
states 1 and 2 are non-twin siblings, (b) Determinization does not halt in this case 
and creates an infinite number of states. 

There are input transducers that are not determinizable, that is for which 
the algorithm does not terminate. When it terminates, the output transducer 
T' is equivalent to T. Thus, it does not terminate with any transducer T that is 
not p-subsequentiable. 

Figure [T](b) illustrates the application of the algorithm to the transducer of 
figure |T](a). Figures |2](a)-(b) show an example of non-determinizable transducer. 

The worst case complexity of determinization is exponential. However, in 
many applications such as large-vocabulary speech recognition such a blow-up 
does not occur and determinization leads to a significant improvement of speed 
versus accuracy at a reasonable cost in space [12]. 

4 Characterization 

This section presents a characterization of p-subsequentiable transducers. The 
characterization is based on the following property. 

Definition 2 . LetT be a finite- state transdueer. Two states q\ and Q2 ofT are 
said to be siblings if there exist two strings x and y in E* sueh that both qi and 
q2 ean be reaehed from I by paths with input label x and there are eycles at qi 
and q2 both with input label y. Two siblings qi and q2 are said to be twins if for 
any paths tti G P{I, x, qi), Ci G P{qijV, qi), tt2 G P ( I , x, 92), C2 G P(q2, V, 92), 

o[7ri]“^o[7T2] = o[7TiCi]“^o[7r2C2] (1) 

T has the twins property if any two siblings in T are twins. 

The twins property was originally introduced by ma to give a characterization of 
functional subsequentiable transducers. The decidability of the twins property 
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was also first proved by the same author (see also 0). The first polynomial- 
time algorithm for testing the twins property was given by jl6J . this algorithm 
was later improved by [2]. More recently, we gave a more efficient algorithm 
for testing the twins property based on the general algorithm of composition 
of finite-state transducers and a new characterization of the twins property in 
terms of combinatorics of words [1]. 

The following factorization lemma will be useful in several proofs. 

Lemma 1. Let T = (S, A,Q, I , F, E, p) be a finite- state transdueer, let tt be 
a path from I to state p € Q and tt' a path from I to p' with the same input label 
w = i[7r] = f[7r']. Assume that |w| > |Qp — 1, then there exist paths tti, tt 2 , tts, 
7t(, 7T2, TTg, such that: 



7T — ^ 

where tt 2 and tt^ are cycles with non-empty input labels and: i[7Tfc] = for 

k = 1,2,3. 

Proof. Consider the transducer U obtained by composing T and T~^: U = 
T o T~^ . Since tt and tt' have the same input label, there exists a path if in U 
with input o[7t] and output o[7t']. Since IV'I = \w\ > |Qp — 1 and U has at most 
IQP states, V' goes at least through one non-empty cycle tp 2 '. ip = ipiip 2 ip 3 - This 
shows the existence of the common factoring for tt and tt' since ip results from 
matching tt and the path obtained from tt' by swapping its input and output 
labels. □ 

The following lemma will be used to prove that determinization terminates when 
the twins property holds. 

Lemma 2. Assume that T has the twins property. Let R be defined by: 

R = {o[7r']“^o[7r] : i[Tr] = ipir'] = w, |w| < |Qp} 

Let qi and Q 2 be two states ofT,Tra path from I to qi, and tt' a path from I to 
q 2 with the same input label: i[Tr] = i[Tr'], then o[7r']“^o[7r] G R. 

Proof. Let w be the common input label of tt and tt' and assume that |w| > |<5p. 
By lemma [H paths tt and tt' can be factored in the following way: 

! Ill 

TT — 

where tt 2 and tt'^ are cycles with non-empty input labels and: i[TTk\ = i[7r(.], for 
k = 1,2,3. Let p = TTiTTg G P{I,qi) and p' = vr^TTg G P{I,qj) and w' = i[p] = 
i[p']. Since T has the twins property, (o[7r(7ry )“^o[7ri7T2] = o[7r(]“^o[7ri]. Thus: 
o[7r(7r27ry “^o[7ri7T27rg] = o[7r(7rg]“^o[7ri7rg] = o[p']~"' o[p]. Since |*[7Tfc]| > 0, w' is a 
string strictly shorter than w. By induction, we can find paths p G P{I,qi) and 
p' G P(/, 52 ), with i[p] = i[p'\ = w' , |w'| < |Qp and such that o[7r']“^o[7r] = 
thus o[7r']“^o[7r] G R. This proves the lemma. □ 
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The following two lemmas are used in the proof of our main result. 

Lemma 3. Let xi,X 2 ,yi,y 2 G Assume that for some integers r > 0 and 
s > 0, the following holds: 

{x,yl)-^X2yl = {x^yl+T^X2y;+^ (3) 

then: 

xf^X2 = {xiyi)~^X2y2 (4) 

Proof Let Xi,X 2 ,yi,y 2 G S* be strings satisfying the hypothesis of the lemma. 
Without loss of generality, we can assume that |cc 2 | > |a;i|. Equality [3] of the 
lemma can be rewritten as: yf^xf^X 2 y 2 = Vi^~^xf^X 2 y 2 ~^^, or: yf{xf^X 2 ) = 
{xf^X 2 )y 2 - Repeated applications of this identity lead to: 

yr{xf^X2) = {xf\2)yT (5) 

for any n > 1. This implies that xf^X 2 is a string and that it is a prefix of yf". 
Thus, yi is a period of xf^X 2 M- There exist an integer p, and two strings u 
and V such that y = vu and xf^X 2 = y^v. Re-injecting this in equation [5] gives 
2/2 = uv and completes the proof of the lemma. □ 

Lemma 4. Let T' = {S, A,Q' F' , E' , X' , p') be a p-subsequential trans- 
ducer equivalent to T = {E, A,Q, I, F, E, X, p). Let q G F be a final state of T 
and q' G F' a final state of F' and assume that there exists x G S* such that 
P{I,x,q) ^ % in T and P{i' ,x,q') % in T' . Then, there exists a finite set 

Z C A^A such that for any paths tt G P{I,q) and ir' G P(i',q'), if i[7T] = i[Tr'\ 
then o[7r]“^o[7r'] G Z. 

Proof. Let tt and n' be two paths satisfying the hypotheses of the lemma. Since 
T and T' are equivalent, we have o[K]p{q) C |r](i[7r]) = |T'](i[7r]). Since T' is 
p-subsequential we also have |r'](i[7r]) = o['k’]p' { q'). Thus: 

o[TT]p{q) C o[K']p'(q') (6) 

Let x G p{q), there exists y in p'{q') such that o[7r]a; = o[7r']y. Define Z C A^A 
as the finite set Z = p{q)p'{q')~^ . Then, o[7r]“^o[7r'] = xy~^ G Z. This proves 
the lemma. □ 

Our main characterization result of this section is given by the following theorem 
which establishes the equivalence between three properties. 

Theorem 1. Let T be a trim finite-state transducer. Then the following three 
properties are equivalent: 

1. T is determinizable; 

2. T has the twins property; 

3. T is p-subsequentiable. 
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Fig. 3. Illustration of the definition of the paths used in the proof of theorem [TJ Only 
the input labels of the paths are indicated, (a) Siblings states q\ and 52 in T, paths tti , 
Cl, 7 T 2 , and C2 defined as in the definition of the twins property, and paths rri and 712 
from qi and q2 to final states, (b) State q in T' , path tt, and paths /ii and /i2 from q 
to final states in T' . 

Proof. 1 3: By definition of the algorithm, the output of determinization is a 

p-subsequential transducer. 

3^2: Assume that T is p-subsequentiable and let T' be a p-subsequential 
transducer equivalent to T. Let qi and <72 be two siblings in T and consider four 
paths 7 Ti, Cl, 7T2, C2 as in the definition of the twins property. Since T is trim, 
there exist a path ttj from q\ to a final state and a path from (72 to a final 
state. Figure [3] (a) illustrates the definition of these paths. Note that or 
may be an empty if (71, resp. (72, is a final state. 

Since T' is equivalent to T, there must be a path tt in T from the initial state 
to a state q with input label xy^ with r > 0, a cycle c at 5 with input label 
1/® with s > 0, and two paths and y2, potentially empty, from g to a final 

state, with input labels respectively i[Tr'f\ and Figure |3](b) illustrates the 
definition of these paths. 

By definition of these paths, we have for any t > 0: 

z[7TlC^+®‘^'l] = *[^cVl] (7) 

The conditions of lemma 0 hold with the final states n[7Ti] and n[/ri] and the 
paths 7TiCi~'’®*7ri and 7rc*/ri. Thus, there exists a finite set Z such that for any 
t > 0: 



o[^iC^+®*7ri]-io[7rcVi] G ^ (8) 

Since Z is finite, there exist at least two distinct integers tq and ti such that: 

o[7TiCi~''®*‘’7ri]“^o[7rc*‘’/ii] = o[7TiCi~'’®*^7ri]“^o[7rc*^7(i] (9) 

That is: 

o['K'i]zo[y,i\~'^ = o[7TiCi~'’®*“]“^o[7rc‘°] = o[7TiCi~'’®*’-]“^o[7rc‘^] (10) 



By lemma |2l this implies: 
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We can prove in a similar way that: 



Thus: 

And by lemma 



o[t^2C2]~^o[tt] = o[7T2C2''‘®]“^o[7rc] 


(12) 


■[7riCi+^]"^o[7r2C2+®] = o[7riCi]"^o[7T2C2] 


(13) 


0[7ri]“^0[7T2] = 0[7TiCi]“^0[7r2C2] 


(14) 



Thus, any two siblings qi and q2 in T are twins, and T has the twins property. 

2 1: Assume that T has the twins property. Let {(gi, zi), . . . ,(qn,z„)} 

be a subset created during the execution of the determinization algorithm. By 
construction, states < 7 i, . . . ,9n can all be reached from I by paths labeled with 
the same input string Let z be defined by: 

z = /\ o[7t] (15) 

By definition of the algorithm, for i = 1, . . . , n, there exists a path tt^ from I to 
qi with input label w such that: 

Zi = Z~^o[Ki] (16) 

Let n = TTi and U' = TTj for some i, j = 1, . . . ,n. We have: 

z-^Zi = o[7T']-'o[7T] (17) 

Thus, by lemma E] zJ^Zi S R with: 

R = {o[7r']“^o[7r] : i[7r] = i[7r'] = w, |w| < |Qp} (18) 

Define K as the maximum length of the elements oi R\ K = max^jg^ \x\. Since 
the remainders in the same subset cannot have a common non-empty prefix, 
for any remainder string z^, there exists at least on remainder Zj such that 
Zi A Zj = e, thus \zj\ + \zi\ = |z“^Zi|. Since zJ^^Zi G R, we have |z:~^Zi| < AT, 
and thus \zi\ < K. This inequality holds for i = 1,... ,n, that is any subset 
remainder belongs to . Thus, a subset necessarily belongs to , 

which is a finite set. This guarantees the termination of the algorithm and thus 
the determinizability of T. □ 



5 Experiments and Results 

We have fully implemented the general determinization algorithm presented in 
section [31 We used a priority queue implemented with a heap to sort the tran- 
sitions leaving each subset and another priority queue to sort the final output 

Note that we may have qt = qj for some choices of i and j. 
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strings. Since the computation of the transitions leaving a subset only depends 
on the states and remainder strings of that subset and on the input transducer, 
one can limit the computation of the result to just the part that is needed. Thus, 
we gave an on-the-fly implementation of the algorithm which was incorporated 
in the FSM library [T^ . 

Our experiments in large-vocabulary speech recognition showed the algo- 
rithm to be quite efficient. It took about 5s using a Pentium III 700MHz with 
2048 Kb of cache and 4Gb of RAM to construct a p-subsequential transducer 
equivalent to a transducer T with 440,000 transitions representing the mapping 
from phonemic sequences to word sequences obtained by composition of two 
transducers. 

We also implemented an efficient algorithm for testing the twins property [I]. 
With our implementation, the p-subsequentiality of the transducer T already 
described could be tested in just 60s using the same machine. 



6 Conclusion 

A new characterization of p-subsequentiable transducers was given. The twins 
property was shown to be a necessary and sufficient condition for the p-subse- 
quentiability of a finite-state transducer without requiring it to be p-functional 
and a necessary and sufficient condition for the general determinizability of trans- 
ducers. We reported experimental results demonstrating the practicality of our 
algorithms for testing p-subsequentiability and for determinizing transducers in 
large-vocabulary speech recognition applications. 



References 

1. Cyril Allauzen and Mehryar Mohri. On the Determinizability of Weighted Au- 
tomata and Transducers. In Proceedings of the workshop Weighted Automata: 
Theory and Applications (WATA), Dresden, Germany, March 2002. 

2. Marie-Pierre Beal, Olivier Carton, Christophe Prieur, and Jacques Sakarovitch. 
Squaring transducers: An efficient procedure for deciding functionality and sequen- 
tiality. In Proceedings of LATIN’2000, volume 1776 of Lecture Notes in Computer 
Science. Springer, 2000. 

3. Jean Berstel. Transductions and Context-Free Languages. Teubner Studienbucher: 
Stuttgart, 1979. 

4. Christian Choffrut. Une caracterisation des functions sequentielles et des functions 
sous-sequentielles en tant que relations rationnelles. Theoretical Computer Science, 
5:325-338, 1977. 

5. Christian Choffrut. Contributions a V etude de quelgues families remarquables de 
fonctions rationnelles. PhD thesis, (these de doctorat d’Etat), Universite Paris 7, 
LITP: Paris, France, 1978. 

6. Karel Culik II and Jarkko Kari. Digital Images and Formal Languages. In Grzegorz 
Rozenberg and Arto Salomaa, editors. Handbook of Formal Languages, volume 3, 
pages 599-616. Springer, 1997. 



34 



C. Allauzen and M. Mohri 



7. Maurice Gross and Dominique Perrin, editors. Electronic Dictionnaries and Au- 
tomata in Computational Linguistics, volume 377 of Lecture Notes in Computer 
Science. Springer Verlag, 1989. 

8. Ronald M. Kaplan and Martin Kay. Regular Models of Phonological Rule Systems. 
Computational Linguistics, 20(3):331-378, 1994. 

9. Lauri Karttunen. The Replace Operator. In 33rd Annual Meeting of the Associ- 
ation for Computational Linguistics, pages 16-23. Association for Computational 
Linguistics, 1995. Distributed by Morgan Kaufmann Publishers, San Francisco, 
California. 

10. Mehryar Mohri. On some Applications of Finite-State Automata Theory to Natural 
Language Processing. Journal of Natural Language Engineering, 2:1-20, 1996. 

11. Mehryar Mohri. Finite-State Transducers in Language and Speech Processing. 
Computational Linguistics, 23(2), 1997. 

12. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Finite-State 
Transducers in Speech Recognition. Computer Speech and Language, 16(l):69-88, 
2002 . 

13. Mohri, Mehryar and Fernando C. N. Pereira and Michael Riley. General-Purpose 
Finite-State Machine Software Tools, http://www.research.att.com/sw/tools/fsm, 
AT&T Labs - Research, 1997. 

14. Dominique Perrin. Words. In M. Lothaire, editor. Combinatorics on words, Cam- 
bridge Mathematical Library. Cambridge University Press, 1997. 

15. Marcel Paul Schiitzenberger. Sur une variante des fonctions sequentielles. Theo- 
retical Computer Science, 4(l):47-57, 1977. 

16. Andreas Weber and Reinhard Klemm. Economy of Description for Single- Valued 
Transducers. Information and Computation, 118(2):327-340, 1995. 




Bidirectional Push Down Automata 



Miguel A. Alonso^, Victor J. Dfaz^, and Manuel Vilares^’^ 

^ Departamento de Computacion, Universidade da Coruna 
Campus de Elvina s/n, 15071 La Coruna, Spain 
{alonso , vilares}@udc . es 
http: //www.grupocole . org 

^ Departamento de Lenguajes y Sistemas Informaticos, Universidad de Sevilla 
Avda. Reina Mercedes s/n, 41012 Sevilla (Spain) 
vjdiazSlsi .us.es 

® Escuela Superior de Ingenieria Informatica, Universidade de Vigo 
Campus As Lagoas s/n, 32004 Orense, Spain 
vilaresOei .uvigo . es 



Abstract. We define a new model of automata for the description of 
bidirectional parsing strategies for context-free grammars and a tabula- 
tion mechanism that allow them to be executed in polynomial time. This 
new model of automata provides a modular way of defining bidirectional 
parsers, separating the description of a strategy from its execution. 



1 Introduction 

The task of designing correct and efficient parsing algorithms can be simpli- 
fied by separating the definition of the parsing strategy from its tabular execu- 
tion. This can be accomplished through the use of automata: the actual parsing 
strategy can be described by means of the construction of a non-deterministic 
pushdown automaton, and tabulation is introduced by means of some generic 
mechanism such as memoization. The construction of parsers in this way allows 
more straightforward proofs of correctness and makes parsing strategies easier 
to understand and implement. 

This approach has been successfully applied to the design of parsing algo- 
rithms for context-free grammars that read the input string left-to-right [5|6| . In 
this article, we define new models of push-down automata which can start read- 
ing the input string in any position, spanning to the left and to the right to in- 
clude substrings which were themselves read in the same bidirectional way. Tab- 
ulation techniques are provided in order to execute efficiently these automata. 

This article is outlined as follows. Section [2] introduces push-down automata. 
A bidirectional extension of push-down automata is presented in section |3] Two 
different tabular frameworks to execute efficiently bidirectional automata are 
defined in Section |5] These frameworks are applied to predictive and bottom-up 
head-corner parsing algorithms. 



J.-M. Champarnaud and D. Maurel (Eds.): CIAA 2002, LNCS 2608, pp. 35-|4^ 2003. 
© Springer- Verlag Berlin Heidelberg 2003 
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2 Push-Down Automata 

A Context-Free Grammar (CFG) is a tuple (Vat, Vt, S, P), where Vn is a finite 
set of non-terminal symbols, Vr a finite set of terminal symbols, S G Vjv is the 
axiom of the grammar, and P is a finite set of productions (rewriting rules) of 
the form A ^ j with A G Vjv and 7 G (VrUVjv)* . Push-Down Automata (PDA) 
are the operational devices for parsing CFG. Following we define a PDA as 
a tuple (Vt,Ps,0,$o,$/) where Vr is a finite set of terminal symbols, Vs is a 
finite set of stack symbols, $0 G Vs is the initial stack symbol, $f G Vs is the 
final stack symbol and 6> is a finite set of SWAP, PUSH and POP transitions. A 
configuration of a PDA is usually defined as a pair (^, a; . . . a„), where ^ G V5 is 
the stack attained and a; . . . a„ the part of the input string oi . . . a„ to be read. 
We consider an alternative and equivalent definition of configuration in which 
the position I is stored in the top element of Thus, a configuration is given by 
the contents of a stack of pairs in Vs x N. The initial configuration is ($o,0). 
Other configurations are attained by applying transitions as follows: 

— The application of a SWAP transition of the form C 1-^ F to a configuration 
^{C,l) yields a configuration ^(F, Z -|- |a|) as a result of replacing C by F and 
scanning the terminal a = o/+i or the empty string a = e. 

— The application of a PUSH transition C 1 — > CF to a configuration ^(C, 1 ) 
yields a configuration ^{C,l){F,l) as a result of pushing F onto C. 

— The application of a POP transition of the form CF 1 — > G to a configuration 
^(G, 1 ){F, m) yields a configuration ^(G, m) as a result of popping G and F, 
which are replaced by G. 

where C,F,G G Vs and a G V^t U {e}. An input string w = oi . . . o„ is succesfully 
recognized by a PDA if the final configuration ($o,0)($/,n) is attained. 

Only SWAP transitions can scan elements from the input string. This is not 
a limitation, as a scanning push transition G 1 — *■ G F could be emulated by 
the consecutive application of two transitions G 1 — > G F' and F' 1-^ F, while 
a scanning pop transition G F 1-^ G could be emulated by G F 1 — > G' and 
G' 1-^ G, where F' and G' are fresh stack symbols. 

We call transitions SWAP, PUSH and POP r-transitions as they can only 
read the input string from the left to the right. Thus, push-down automata 
can only be used to implement unidirectional parsing strategies that read the 
input string in the same way. As an example of the kind of parsers that can 
be implemented, a compilation schema0 of a context-free grammar into a push- 
down automaton implementing the Earley’s parsing strategy [Ij is derived. In 
the resulting automaton, Vt is the set of terminals of the source grammar. Vs is 
the union of {$0,$/} and a set of dotted production^ the initial element $0 is 
used to start computations, the final element $/ is (S' ^ 6u) and O contains the 

^ A compilation schema is a set of rules indicating how to construct an automaton 
according to a given grammar and parsing strategy. 

^ Dotted productions A ^ a* P are used to indicate that the part a of the production 
has been recognized. 
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[INIT] $0 $0 (S^.q) 

[PRED] {A^amBP) i — ^ (A ^ a • Bj3) {B .7) 
[SCAN] {A^a»aj3) ^ {A^aa* (3) 

[COMP] (A ^ a • Bj3) {B 7.) 1 — > {A ^ aB • j3) 

Fig. 1. Compilation schema for the Earley’s strategy 



1] C^F 

[B,j\F,l + \a\] a = ui+i or a = e 



C^F 

[F,j, I + jaj] a = ai+i or a = e 



[B,r,C,l] 

[C,t,F,l] 



Cl 



CF 



[C,j,l] 

[F,l,l] 



C 



CF 



[C,l-,F,m] 

[B,j-,C,l] 

[B,j;G,m] 



G 



[F,l,m] 

[CJ,l] 

[G, j, m] 



CF 



G 



Fig. 2. Inference rules for PDA with S'^ items (left-hand) and items (right-hand) 



set of transitions derived by the compilation rules shown in Fig. [H A [INIT] 
transition is in charge of starting the parsing process. [LPRED] transitions 
predict a non-terminal B placed just after the dot in a given production. Once 
a production having this non-terminal in its left-hand side has been completely 
recognized, the dot is advanced by the application of a transition [COMP]. 
Terminals are recognized by [SCAN] transitions. 

The direct execution of PDA may be exponential with respect to the length of 
the input string and may even loop. To get polynomial complexity, we must avoid 
duplicating computations by tabulating traces of configurations called items. The 
amount of information to keep in an item is the crucial point to determine to 
get efficient executions. Following we know that items, storing the two 
elements placed on the top of the configuration stack, can be used to design 
tabular interpretations which are sound and complete for any parsing strategy. 

New items are derived from existing items by means of inference rules of 
the form conditions similar to those used in grammatical deduction 

systems |H], meaning that if all antecedents are present and conditions are sat- 
isfied then the consequent item should be generated. Conditions usually refer to 
transitions of the automaton and to terminals from the input string. The set of 
inference rules for items is shown in Fig. [21 Computations start with the item 
[□,0;$o,0] and finish with the item [$o,0;$/,n]. 
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[INIT] 



[$o,0,0] 

[S — > »a, 0, 0] 



[PRED] 



[A^ a* B(3, j, 1] 



[SCAN] 



[A ^ a • a(3,j,l] 

[n, /,/ + !] 

[A ^ aa» j3,j, I + 1] 



[COMP] 



[B -> 7»,Z,m] 

[A ^ a» Bp,j,l] 
[A^ aB» f3,j,m] 



Fig. 3. Deduction steps for the Earley’s algorithm 



From we also know that if the results of the non-deterministic compu- 
tations are constrained only by bottom-up propagation of computed facts (e.g. 
bottom-up and Earley strategies, but not pure top-down strategies) items 
storing only the top element of the configuration stack can be used to derive 
a sound and complete tabular interpretation. The set of inference rules for 
items is shown in Fig. Computations start with the item [$o,0,0] and fin- 
ish with [$/,0,n]. Fig. [3] shows the set of deduction steps corresponding to the 
parsing scheme^ Earley obtained by applying these inference rules to an automa- 
ton resulting from the compilation schema shown in Fig. H] The worst-case time 
complexity for the 5^ and inference rules and for the Earley schema is 0{n^). 

Although PDA can only be used to describe unidirectional strategies, their 
tabulation technique can be extended to read the input string left-to-right, right- 
to-left and bidirectionally [Zj . This kind of bidirectional tabulation makes possi- 
ble to implement robust parsers by means of PDA but it does not make possible 
to specify bidirectional parsing strategies due to PDA transitions does not pro- 
vide any way of controlling the direction of the parsing process. 



^ Although the items involved in these inference rules are usually called items, 
actually they are not true 5"^ items but items due to each item [C, j, 1] stores 
the top element (C, 1) of the configuration stack plus the position j of the second 
element (B,j) [2]. 

^ In brief, a parsing schema is a deductive parsing system where inference rules are 
called deduction steps and conditions on the existence of a given terminal a;+i are 
represented by means of special antecedent items of the form [a, I, I + 1] called hy- 
pothesis. 



Bidirectional Push Down Antomata 



39 



3 Bidirectional Push-Down Automata 

Bidirectional parsing strategies can start computations at any position of the 
input string and can span to the right and to the left to include substrings 
which were scanned in a bidirectional way by some subcomputations. As a first 
step towards the definition of a Bidirectional Push-Down Automata (BPDA), 
we must adapt configurations in order to be able to represent the discontinues 
recognition of the input string. Thus, configurations of a BPDA will be given by 
the contents of S', a stack of triples in Vg x N x N. The initial configuration is 
($o,0,0). Other configurations are attained by applying transitions as follows: 

— The application of a SWAPr transition of the form C F to a configu- 
ration S{C, k, 1) yields a configuration S(F, k, I + |a|) as a result of replacing 
C by F and scanning the terminal a = or the empty string a = e to the 
right of the substring spanned by C. 

— The application of a SWAPl transition of the form C l F to a configu- 
ration S(C, k, 1) yields a configuration S(F, k — |a|, as a result of replacing 
C by F and scanning the terminal a = Ofc or the empty string a = e to the 
left of the substring spanned by C. 

— The application of a PUSHr transition C i — >r C F to a configuration 
S{C,k,l) yields a configuration H{C,k,l){F,l,l). It is expected that F will 
span a substring inmediatly to the right of the substring spanned by C. 

— The application of a PUSHl transition C i — CF to a configuration 
F(C, k, 1) yields a configuration S(C, k, 1){F, k, k). It is expected that F will 
span a substring inmediatly to the left of the substring spanned by C. 

— The application of a PUSHy transition of the form C >-^u CF to a config- 
uration S{C, k, 1) yields a configuration F{C, k, 1){F, m,m+ |a|) as a result 
of pushing F onto C and scanning the terminal a = Qm+i or the empty 
string a = e. PUSH;/ transitions are undirected in the sense that a is not 
necessarily adjacent to the substring spanned by C. 

— The application of a POPr transition of the form CF i — >r C to a config- 
uration F{C, k,l){F,l,m) yields a configuration F{G,k,m). The substring 
spanned by F is adjacent to the right of the substring spanned by C. 

— The application of a POPl transition of the form CF i — G to a config- 
uration F{C,k,l){F,m,k) yields a configuration F{G,m,l). The substring 
spanned by F is adjacent to the left of the substring spanned by C. 

An input string oi . . . a„ is succesfully recognized by a BPDA if the final con- 
figuration ($o,0,0)($/,0,n) is attained. SWAPr, PUSHr and POPr transitions 
are the r-transitions corresponding to unidirectional PDA. SWAPr, PUSHr and 
POPr transitions are l-transitions that advance “to the left” in the reading of 
the input string. However, the union of r-transitions and 1-transitions is not suf- 
ficient to implement bidirectional parsers, we need PUSHrt transitions of the 
form C >-^u G F to start subcomputations at any position of the input string. 
We guarantee that, in any computation recognizing the input string, each ter- 
minal in the input string is read only once by means of the definition of SWAPr 
and SWAPr transitions (they can not re-read elements which are in the span 
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of the top element of the stack) and the definition of POPr and POPl transi- 
tions (they can not pop stack elements spanning overlapping substrings). Pop 
transitions also ensure we read the input string and not a permutation of it. 

We define Bidirectional Push-Down Automata (BPDA) as a tuple 
(Pr,Ps,0B,$o,$/) with Ob containing SWAPr, SWAPl, PUSHr, PUSHr, 
POPr, POPl and PUSHj/ transitions. As an example of the kind of parsers 
that can be implemented using BPDA, a compilation schema of a context-free 
grammar into a bidirectional push-down automaton implementing a predictive 
head-corner parsing strategy is derived. Head-corner parsing strategies can be 
applied to context-free grammars in which each production has an element of 
the right-hand side marked as the head of the production. For empty produc- 
tions A ^ e, the empty string e is considered the head of the production. The 
head-corner relation >h on Vn x {Vn U Vt U {e}) is defined hy A >h X if there 
is a production A — s- aX(3 with X the head of the production. If A ^ e then 
A >h €■ The transitive and reflexive closure of >?, is denoted >^. In the resulting 
automaton, Vr is the set of terminals of the source grammar. Vs is the union 
of {$0, $/}, the set of non-terminals of the source grammar and a set of dotted 
productions A ^ a • /3 • 7 used to indicate that the part f3 of the production has 
been recognized; the initial element $0 is used to start computations; the final 
element $y is (S' ^ •<5»); and Ob contains the set of transitions derived by the 
compilation rules shown in Fig. IH [LPRED] and [RPRED] transitions predict 
a non-terminal to the left and to the right, respectively. Once a production having 
this non-terminal in its left-hand side has been completely recognized, the dot is 
advanced by the application of a pair of transitions [LCOMPl]-[LCOMP2] or 
[RCOMPl]-[RCOMP2] . The head-corner of a given non-terminal is found by 
[HCt] and [HC,] transitions. [HCjv] transitions traverse backwards the chain 
of head-corners of a given non-terminal. [LSCAN] and [RSCAN] transitions 
recognize terminals to the left and to the right, respectively. 

4 Context-Free Languages and BPDA 

BPDA exactly accepts the class of context-free languages. Given a CFG Q, the 
language accepted by the bidirectional push-down automaton built following the 
compilation schema shown in Fig.|4]is the language recognized by G- Therefore, 
the class of context-free languages is included in the class of languages accepted 
by BPDA. Given a BPDA A = (Vr, Pg, 6 >b, $ 0; S/) we can construct a CFG 
G = {Vn,Vt,S,P) where V/v G Fs x Vs, S = ($0,$/) and the productions in P 
are obtained from transitions in 0b as follows: 

— A production {E, F) {E, C) a for each C F G 0b and E G Vs- 

— A production {E, F) ^ a {E, C) for each C i-^l F G 0b and E G Vs- 

— A production (0, F) ^ e for each C 1 — >r C F G Ob and E G Vs - 

— A production (0, F) ^ e for each 0 1 — C F G 0b and E G Vs- 

— A production (0, F) ^ a for each 0 CFG 0b and E G Vs- 

— A production {E, G) —>■ {E, 0) (0, F) for each CF 1 — >r 0 G 0b and E G Vs- 
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[LCOMP2] {B ^ aC • P • j) (C •S*) i — (B ^ a»C(3»-i) 

[RCOMPl] (C) (C •&•) I — >R (C •5») 

[RCOMP2] {B ^ a* (3»C'y) {C •S*) i — >r {B ^ a» (3C »-i) 

Fig. 4. Compilation schema for a predictive head-corner strategy 

— A production {E, G) — > (C, F){E, C) for each CE i — G C 6>b and E e Vs- 

Applying induction in the length of the derivations, we can show that ($o, $/) 
fli . . . a„ if and only if a computation of A starting at ($o, 0, 0) attains a config- 
uration ($ 0 , 0, 0)($/, 0, n) reading ai . . . o„ from the input string, i.e. G exactly 
recognizes the language accepted by A. Therefore, the class of languages accepted 
by BPDA is included in the class of context-free languages. 

5 Tabulation of BPDA 

As in the case of PDA, the direct execution of BPDA may be exponential with 
respect to the length of the input string and may even loop. To solve this problem, 
in this section we extend the and tabulation techniques to the case of 
bidirectional push-down automata. 

5.1 The Framework 

From [5] we know that extensions of push-down automata can be tabulated by 
using items storing the two elements on the top of the configuration stack. 
In the case of BPDA, items are of the form [B, i, C, A:, ^], indicating the 
part of the input string Uk+i • ■ ■ a; recognized by the top element G and the 
part Oi+i . . . a j recognized by the element B placed immediately under G. The 
set of inference rules for items is shown in Fig. 0 Computations start with 
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[B,i,j-C,k,l] 
[B,i,j-,F,k,l + |a|] 

[B,i,j;C,k,l] 
[B,i,j;F,k- \a\,l] 

[B,i,j-C,k,l] 

[C,k,l-,F,l,l] 

[B,i,j-C,k,l] 
[C, k, I, F, k, k] 



C^rF 
a = aj+i or a = e 

C^L F 
a = flfe or a = e 

C^R CF 
C^L CF 



[B,i,j\C,k,l] 

\C, k, I- F,m,m + |a|] 

[C,k,l-F,l,m] 

[B,i,j;C,k,l] 

[B,i,j-G,k,m] 

[C, k, I- F, m, k] 
[B,i,j;C,k,l] 
[B,i,j-G,m,l] 



CC^u CF 
a = a-m+i or a = £ 



CF^rG 



CF^l G 



Fig. 5. Inference rules for items 



the item [□, 0, 0, ; $o, 0, 0] and finish with [$o, 0, 0; $/, 0, n]. The worst case time 
complexity with respect to the length n of the input string is 0(n^). 

The application of this set of inference rules to an automaton resulting from 
the compilation schema shown in Fig. H] yields the set of deduction steps shown 
in Fig. El (where ^ refers to a dotted production or $o) which is very close to 
the set of deduction steps corresponding to the predictive head-corner parsing 
schema pHC defined by Sikkel in 0 chapter 11], also working in 0{n^) time 
complexity. The main difference is that predictive steps of the schema pHC have 
stronger constraints with respect to the part of the input string to be consid- 
ered when seeking for a head-corner. A minor difference is that left-completer 
and right completer steps have been splited into [LCOMPl]-[LCOMP2] and 
[RCOMPl] [RCOMP2] pairs. 

5.2 The Framework 

The complexity of the tabular framework can be reduced by considering more 
compact kinds of item. From [5] we also know that a sound and complete tabular 
interpretation for a given extension of push-down automata can be obtained 
using items that store the top element of the configuration stack. In the case 
of BPDA, items are of the form [C,k,l], storing the top element C with 
the corresponding positions k and I of the substring spanned by it. The set of 
inference rules for items is derived from the set of inference rules for S^, as 
is shown in Fig. 0 Computations start with the item [$o,0, 0] and finish with 
The worst case time complexity with respect to the length n of the 
input string is O(n^). 

As an example of the kind of strategies that can be implemented using the 

framework, we show in Fig. |§]the compilation schema of a context-free gram- 
mar into a bidirectional push-down automaton implementing a bottom-up pre- 
dictive head-corner parsing strategy. [HCt] and [HC,j transitions start the 
bottom-up recognition of head-corners. [HCat] transitions traverse backwards 
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[INIT] 



[□,0,0;$o,0,0] 
[$o, 0,0; ^,0,0] 



[HC, 



[<P,s,t;A,l,r] 
[b,j - l,j] 



[A,l,r;B ^ a»b»j,j - l,j] 
[A,l,r;C ^ 



7T A B ~>h b 



[HCiv 

[HC 



B>hC 



[LPRED] 

[RPRED] 

[LSCAN] 



[A, Z, r; B ^ a • C • 7] 

[<P,s,t-,A,l,r] 
[AJ,r-,B ••JJ] 

[A,l,r;B aC • /3 • 



A>lB 



[B aC • P 

[A,l,r;B ^ a» P»C-f,i,j] 
[B ^ a» P»C-f,i,j;C,j,j] 

[A, l,r; B ^ aa • P • j, k] 
[a,j - 1, j] 



[RSCAN] 



[A, l,r; B ^ a • aP • J, j — I, k] 
[A,l,r;B a» P • a'y,i,j] 

[g, j, j + 1 ] 

[A, l,r; B a • Pa • + 1] 



[LCOMPl] 



[<P,j, k; C,j,j] 
[C,j,j;C ^ •StiJ] 
[<P,j,k;C^»5»i,j] 



[LCOMP2] 



[A,l,r; B aC • p • j, k] 

[B ^ aC • /3 • 7, j, k;C ^•6»i, j] 
[A, l,r; B ^ a • CP • 7, i, k] 



[RCOMPl] 



[Cdd',c ^ *5 • j,k] 
•(5*i, k] 



[RCOMP2] 



[A,/,r;B ^ a*/3*C7,t,j] 

[B ^ a • /3 • C7, i, j; C ^ •b • j, k] 
[A, l,r; B ^ a • pC • 7, i, k] 



Fig. 6. Deduction steps for a predictive head-corner parsing schema 
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[G,k,l] 


G^r F 


[C,fc,l] 


[F,k,l + \a\] 


a = az+i or a = e 


[F, m, m-l- a ] 


[C,k,l] 


G^l F 


[F,l,m] 

[G,k,l] 


[F,k-\a\,l] 


a = ak or a = € 


[G,k,l] 

[F,l,l] 


G^rCF 


[G, k, m] 


[G,k,l] 


n 1 V r r F 


[F, m, k] 
[G,fc,/] 


[F,k,k] 




[G,m,l] 



C^u CF 
a — flm+i or a = e 



CF^rG 



CF^l G 



Fig. 7. Inference rules for items 



[INIT] $0 $0 (S^..q) 

[HCt] {A ^ 6i • S2 • 53) i-^u {A ^ 61 • S2 • 5 s) {B ^ a» a^-y) 
[HCjv] {A^»I3») ^R {B^a»A»j) 

[HC,] (A ^ 5i . 52 • 53 ) (A ^ 5i . 5a • 5s) {B ..) 

[LSCAN] {B aa • 13 • j) (B^a»af 3 »'y) 

[RSCAN] (B-^a»/3»a7) ^r [B ^ a» j3a»'y) 

[LCOMP] {B ^ aA» P»j) {A^ tS*) 1 — >l (B^a»Af3»'y) 
[RCOMP] {B ^ a* I3» A-f) {A^ tS*) 1 — >r {B ^ a» f3A»j) 

Fig. 8. Compilation schema for a bottom-up head-corner strategy 



the head-corner relation. Terminals are recognized to the left and to the right 
by [LSCAN] and [RSCAN] transitions, respectively. Once the right-hand side 
of a production has been completely recognized, [LCOMP] and [RCOMP] 
advance the dot of the production having the left-hand side of that rule to the 
left or to the right of the dot, respectively. 

When the inference rules are applied to an automaton resulting from 
the compilation schema shown in Fig. [S] we obtain the set of deduction steps 
corresponding to the parsing schema buHC defined by Sikkel in [^ chapter 11], 
as shown in Fig. E] considering that in the parsing schemata framework [Sj the 
antecedent [A ^ 5i • ^2 • 1 ^ 3 , 1, r] in steps [HCt] and [HCg] can be filtered out 
as it does not restrict the application of these steps. 
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[INIT] 



[$o, 0 , 0 ] 

[S' • a, 0, 0] 



[HCt] 

[HCiv 

[HC,] 

[LSCAN] 

[RSCAN] 



[A ^ (5i • ^2 • S 3 , 1, r] 
[a,j - l,j] 



[B ^ a»a»^,j - l,j] 

[A 

[B ^ a» A»'y,i,j] 

[A ^ 5i • S 2 • (53, 

[B 

[B ^ aa» [3» 7 , j, fc] 

[aj - 1, j] 

[B ^ a • a/3» 7 , j - l,k\ 



[B ^ Q • /3» a7,i,i] 
[a, 3, 3 + 1] 

[B ^ a»/ 3 a» 7 ,i,i + 1] 



[LCOMP] 



[RCOMP] 



[A •S»,i,j] 

[B aA» f3 •'y,j,k] 
[B — > a • Ap • 7 , i, k] 



[A •S»,j,k] 

[B ^ a • P» A'y,i,j] 
[B ^ a • pA • 7 , i, k] 



Fig. 9. Deduction steps for a bottom-up head-corner parsing schema 



6 Conclusions 



In order to provide a common framework for the description of bidirectional 
parsing algorithms for context-free grammars, we have defined a new class of 
bidirectional push-down automata which works in polynomial time. We have 
also shown how tabular parsing algorithms can be derived from the automa- 
ton describing the parsing strategy, and the tabulation technique associated to 
the automata model. As illustration, we have considered the case of the predic- 
tive head-corner and bottom-up head-corner strategies proposed in 0 but the 
approach can be applied to the other bidirectional strategies defined in the lit- 
erature. This approach can also be extended to automata models for extensions 
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of context-free grammars. In this direction, we have investigated a bidirectional 
version of Linear Indexed Automata for Tree Adjoining Grammars [l]. 

The use of bidirectional push-down automata to define parsers allowed us to 
concentrate on the parsing strategy itself, abstracting for details of implementa- 
tion such as the input positions spanned by a production or the information we 
must track into items to guarantee the correctness of a parsing strategy. 
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Abstract. We consider non-self-embedding (NSE) context-free gram- 
mars as a representation of regnlar sets. We point out its advantages with 
respect to more classical representations by finite automata, in particular 
when considering the efficient realization of the rational operations. We 
give a characterization in terms of composition of regular grammars and 
state relationships between NSE grammars and push-down automata. Fi- 
nally we show a polynomial algorithm to decide whether a context-free 
grammars is self-embedding or not. 



1 Introduction 

Regular languages play a central role in formal language theory, and indeed in 
theoretical computer science. Evidence is given by a wide and continued litera- 
ture, together with a variety of practical problems dealing with regular languages 
(e.g. compiling, text editing, DNA sequences, and so on). Different properties 
and questions have been investigated and many points of view have been con- 
sidered: logical, algebraic, analytical or algorithmic. The fundamental question 
we deal with in this paper is representation of a regular language^ indeed defined 
as an abstract concept. 

There are several ways of representing a regular language, each one with its 
peculiarity, advantages and disadvantages and this results in a richness of the 
theory. Every time we deal with a regular language, we can choose the most con- 
venient type of representation for it. In this contest, all the procedures to trans- 
form a representation into an equivalent one, together with their time and/or 
space complexity, are of great interest. Regular languages are classically repre- 
sented by: regular expressions, finite automata (deterministic, non-deterministic, 
two-way, ...), logical formalisms and regular grammars. We observe that regu- 
lar grammars are only a different way to represent non-deterministic finite au- 
tomata. Therefore, if we look for a representation of regular languages which 

* This work was partially supported by MIUR project Linguaggi formali e automi: 
teoria e applicazioni. 

J.-M. Champarnaud and D. Maurel (Eds.): CIAA 2002, LNCS 2608, pp. 47-|5^ 2003. 

@ Springer- Verlag Berlin Heidelberg 2003 
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is “better” with respect to finite automata, we have to consider a larger class 
of grammars. In this paper we focus on a particular family of grammars called 
non-self-embedding (NSE) grammars, strictly including the regular grammars, 
but still representing regular languages. 

A context-free grammar is self-embedding (SE) if there is a derivation for a 
variable A, of type A ==4> aAj3 with both a, (3 non-empty. A context-free gram- 
mar is non-self-embedding (NSE) if it is not SE. From a result due to Chomsky 
1^, we know that any NSE grammar generates a regular language. Notice that 
the vice versa is always true: every regular language admits a NSE grammar 
representing it (a right-linear grammar is NSE). Despite a poor literature on 
NSE grammars (we are not aware of any other result), we believe that NSE 
grammars can be regarded to as an interesting representation for regular lan- 
guages. Indeed, we exhibit a simple example showing that the representation of 
a regular language can be much more concise (an exponential gap!) via NSE 
grammars than via finite automata. This example is, actually, a simple case of 
a more general situation where the representation by NSE grammars is always 
more “compact” . The idea is to exploit the structure of a grammar that has 
variables that can be used to generate different instances of the same language. 

In this paper we first discuss the consequences of the SE (NSE) property 
on the grammar structure. We show that a NSE grammar can be expressed in 
terms of regular grammars. More precisely, we introduce an operation between 
grammars we call (B- composition that corresponds to the substitution operation 
between languages. Then we characterize the NSE grammars as those grammars 
that can be obtained by a finite number of 0-compositions of regular gram- 
mars. As immediate consequence, one obtains the Chomsky result, since regular 
languages are closed under regular substitutions. 

We also investigate the realization of rational operations on languages using 
NSE grammars and we highlight the advantages with respect to finite automata 
in many situations: for instance in the representation of the square of a lan- 
guage, or, more generally, in the representation of regular expressions containing 
different instances of the same language. Moreover, we remark that rational op- 
erations have a very simple representation in terms of ©-composition of NSE 
grammars and that NSE grammars are in general much more concise when we 
exploit the operation of composition on grammars some of which are identical. 

We then study what the SE property yields on push-down automata (PDA). 
We show that a PDA, obtained by the canonical construction from a NSE gram- 
mar, has stack size bounded by some constant. Recall that the vice versa is 
always true: i.e. PDA with constant stack size recognize regular languages. 

In the last part of the paper we give an algorithm that tests whether a 
context-free grammar is NSE or not. This algorithm has a running time polyno- 
mial in the size of the grammar. As consequence, this shows that the SE property 
has also some relations on decidability results on CFG. Indeed it is well-known 
that it is undecidable whether a CFG generates a regular language or not. 

For lack of space, most of the proofs are omitted from this extended abstract. 
We refer the interested reader to the full paper [T]. 
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2 Basic Notations and Definitions 

We assume the reader familiar with basic formal language theory including finite- 
state automata, push-down automata, regular expressions and grammars. We 
will mainly use notations as in [ 4 ]. A context-free grammar (CFG) will be de- 
noted by G = {V,T, P, S) where V,T,P,S are the sets of variables, terminals, 
productions and the start symbol, respectively. We always assume that V DT = 0 . 
We will use informally the notion of size of a grammar as the space needed for 
its description. By regular grammar we indicate a grammar that is either left- 
or right- linear. We now recall the definition of self-embedding grammar. 

Definition 1. A context-free grammar G = {V,T, P, S) is self-embedding (SE) 
if there exists a variable A sueh that A aA(3 with a, /3 S (E U T)'^ . 

In this paper we will be interested in non-self-embedding (NSE) context-free 
grammars i.e. satisfying the properties that, for all variables A, any derivation 
A aA/3 implies that either a = e or /3 = e. Notice that given a regular language 
L there exists a NSE grammar for L: in particular a right-linear grammar is NSE. 
A well-known result states that also the reverse is true. The following theorem 
is due to Chomsky (see also m- 

Theorem 1. The language generated by a NSE grammar is regular. 

3 A Characterization for NSE Grammars 

In this section we revisit the Chomsky theorem in a more general framework. In 
particular we give a Decomposition Theorem stating that the NSE grammars are 
exactly the context-free grammars obtained as a particular composition of regular 
grammars. We emphasize that such composition operation, corresponds to the 
operation of substitution on languages. As consequence, NSE grammars are a 
formalism which is exponentially more compact with respect to non deterministic 
finite automata. 

Definition 2. Let Gi = Pi, Si) and G2 = {V2,T2, P2, S2) be two 

context-free grammars, with Vi fl V2 = 0 - The 0 -composition of Gi and G2 
is given by: 

G=Gi(BG 2 = {V,T,P,S) 

where V = ViUV2, T = Ti\V2 UT2, P = Pi U P2 and S = Si. 

Notice that, if TiC V2 =0 then L(Gi0G2) = L{Gi). Moreover, by definition, 
if G is a grammar and G = Gi 0 G2, then the corresponding sets of variables, Vi 
and V2, are disjoint. This implies that a derivation of a word w in G can be split 
in two phases. First, we apply only rules of Gi (to variables in Vi) starting from 
its start symbol and get a word w' (on the alphabet Ti). Second, we apply only 
rules of G2 (starting from symbols in Ti n V2) and get word w. More formally: 
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Remark 1. If G = Gi 0 G 2 then language L{G) can be obtained from L{Gi) by 
applying a substitution that maps each symbol A G Ti n V 2 to La{G 2 ) where 
La{G 2 ) is the language generated by the grammar G 2 using A as start symbol. 

Similarly, we consider sequences of n applications of 0-composition, for some 
n. It is not difficult to verify that the 0-composition is associative. Let Gi0G2 0 
. ■ . 0 G„ = G with Gi = (Vi, Ti, Pi, Si), for i = 1, 2, . . . , n and G = (V, T, P, S). 
By definition, there should be Vi n Vi^i = 0 for i = l,...n— 1. Moreover, 
V = V 1 UV 2 U . . .UV„, T ^ T 1 UT 2 U . . .UT„\(V 2 U . . .UV„), P ^ P 1 UP 2 U. . .UP„ 
and S = Si- The proof of the next lemma can be found in the full paper [T] . 

Lemma 1. Let Gi and G 2 be two NSE grammars, then Gi 0 G 2 is NSE. 

As consequence of the above lemma, observe that, the 0-composition of two 
regular grammars Gi and G 2 (either left- or right- linear) gives, in general, a 
non-self-embedding grammar. The main result of this section states that the 
reverse is also true: every NSE grammar can be obtained as 0-composition of a 
finite number of regular grammars. We now give the Decomposition Theorem. 

Theorem 2. Let G = (V, T, P, S) be a NSE grammar. Then, there exist n reg- 
ular grammars Gi, G 2 , . . . , G„ , for some n, such that G = Gi 0 G 2 0 . . . 0 G„. 

Proof. (Sketch) The idea of the proof is to “extract” all grammars Gi, G 2 , . . . , Gn 
from G, one after the other. We start with Gi = {V\,Ti,Pi, ^i) that is obtained 
from G by considering some of G variables as terminals. More precisely, S\ = S 
while variables in Vi are only those ones that are both “reachable” from the 
start symbol and from which the start symbol can be “reached” . All the other 
G variables will be added to the set of terminals together with G terminals. 

Then we proceed by defining in order G 2 , . . . , G„. In general, grammar Gi = 
{Vi, Ti, Pi, Si), i = 2, ...,n, is defined after Gi_i as follows. The start symbol Si 
is one of the variables of G that is considered terminal in Gi-\. Then variables in 
Vi are G variables that are both “reachable” from Si and from which Si can be 
“reached” . The productions in Pi C P are defined by selecting the rules having 
Vi variables in the left-hand side. All the symbols on the right-hand side that 
are not in V) will constitute the set of terminals Ti . 

The key idea behind the proof lies in the order of the Gfs (i.e. the order in 
which such start vertices Si are chosen) that is based on a topological ordering of 
vertices of a particular production graph. This guarantees that, chosen a gram- 
mar Gi, the possible G’s variables in the right-hand side of rules in Pi do not be- 
long to set Vj for j < L □ 

The complete proof can be found in the full paper PP . We only remark that such 
proof is constructive, i. e. it gives an effective and efficient (polynomial time) 
procedure to decompose a NSE grammar in terms of regular grammars. 

Observe that Theorem [^together with Lemma [T] implies that NSE grammars 
are exactly those grammars obtained as a 0-composition of regular grammars 
(either right- or left-linear). Moreover, by Remark[Tl one obtains Chomsky result 
(Theorem[T]) as a corollary, since regular languages are closed under substitution. 
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To conclude, we remark that, despite the decomposition of NSE grammars in 
left- and right-regular grammars evoke a similarity with two-way finite automata, 
the two representations are quite different (see Example [T] in the next section). 

4 Advantages of NSE Representations 

In this section we compare the representation of regular languages by NSE gram- 
mars with respect to finite automata (regular grammars). We will see that NSE 
grammars are much more concise when we exploit the operation of composition 
on grammars some of which are identical. 

We start by defining the rational operations between languages represented 
via NSE grammars. Rational operations of union, concatenation and star of 
languages are basic in the construction of regular languages. 

Let G\ = {Vi,Ti,Pi,Si) and G2 = (V2, ?2, P2, 'S'2) be two context-free gram- 
mars. Without loss of generality, we assume that the sets of variables Vx and V2 
are disjoint. We define the following grammars corresponding to the standard 
constructions for the union, concatenation and star of context-free grammars [^: 

- Gu = {Vu,Tu,Pu,Su) with K = El U E2 U { 5 „}, T„ = Ti U T2, = 

PiUP2U{5„^^i|^2} 

- G, = (E, E, P„ E) with E = E u E U {E}, E = E u T2, P, = Pi u P2 u 
{E ^ EE} 

- Gs = (E,E,P„E) with E = EU{E|, E = E, Ps = PiU{E ^ EEk). 

Proposition 1. If the qrammars Gi and G 2 are NSE then the qrammars G„, 
Ge and G, are NSE. 

Observe that, when applying rational operations to NSE grammars, the re- 
sulting grammar is of the “same” type of the starting ones (while in the automata 
case we get non-determinism or e-transitions). Moreover the size of the result- 
ing grammar only increases by an additive constant: we add only one production 
(while in the automata case, we have to add a number of transitions that depends 
on the number of transitions in the starting automata). 

The most interesting case is the concatenation when P(Gi) = P(G2): when 
we define the grammar Gc for the square L' = LL of a regular language L = 
L{G), we do not need to make two disjoint copies of the grammar G. Then, 
the size of the NSE grammar Gc differs from the size of G only by an additive 
constant. On the other hand classical constructions give a NFA for L' = LL 
whose size is at least twice the size of a NFA for L. Observe that this is only one 
simple case of a typical situation. In fact, by definition, any regular language 
is obtained as application of rational operations to simpler languages (some of 
which are often identical!). We observe that the rational operations can be also 
given in terms of 0 -compositions. More precisely, we define: 

G+ = ({E, {E, El, {S ^ EIEI, S) 

G* = {{S}, {E, El, {S ^ EEI, S) 

G* = {{S},{Sx},{S^SSx\e},S). 
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Then, it is immediate to verify that G„ = G+ 0 Gi 0 G2, Gc = G* 0 Gi 0 G2, 
G, = G*0Gi. 

We observed that NSE representations are more concise than finite automata 
ones when the language contains concatenations of copies of same languages. 
Indeed similar arguments apply to 0-compositions. We find that NSE grammars 
are in general much more concise when we exploit the operation of composition 
on grammars some of which are identical. The next example will show that the 
difference in size between the two representations can be even exponential. 

Example 1. Let L = Any NFA for L have at least 2 ^ states, otherwise it 

had a loop on its states and would recognize an infinite language. 

Let G = (E, T, P, Ak), where V = {Aq, Ai,..., Ak}, T = {a}, and 

P = {Ak —>■ Ak-iAk-i, Ak-i Ak- 2Ak-2, , Ai -I- A qAq, Aq ^ a}. 

It is not difficult to see that G is NSE and that P(G) = {a^ }. This shows 
that the minimal size of a NFA accepting a regular language L can be exponential 
with respect a NSE grammar generating L. 

Notice that G is obtained as composition of very simple grammars exploiting 
the effect of generating and substituting two copies of the same languages in 
different occurrences. More precisely, G can be decomposed as G = Gfc0Gfc_i 0 
• • • 0 Gi, where Gi = ({Ai}, {a}, {Ai — > aa}, {Ai}) and 

Gi = ({Ai}, {Ai_i|, {Ai — > Ai_iAi_ij, {Ai}), f = 2, • • • , fc. 

Remark that for every i = 2, ■ ■ ■ ,k, L{Gi 0 • • • 0 Go) can be obtained by applying 
a substitution that maps Ai_i to L(Gi_i 0 • • • 0 Gi) (see Remark [TJ. 

5 NSE Grammars and Push-Down Automata 

In this section we analyze the relationships between NSE grammars and push- 
down automata (PDA). The main result states that a context-free grammar is 
NSE if and only if the corresponding equivalent PDA has a bounded stack. 

We recall that a grammar G = (V, T, P, S) is in canonical form if its produc- 
tions are of the kind A ^ 07 or A ^ 7, with A G V, a G T, and 7 S P*. As well 
known | 4 ] , for any context-free grammar G, there exists a context-free grammar 
G' in canonical form, such that L{G') = L{G). Moreover, G = {V,T, P, S) in 
canonical form can be easily transformed in an equivalent PDA Me H]. More 
precisely, Mq = {{q},T,V,S,q, S,^), where the transition function S is defined 
as: (q,j) G S(q,a,A) if and only if A — > aj, with a G T U {e} and A G V. The 
PDA Mq simulates the leftmost derivations of G: 

S^iWj {q,w,S)\~lj^ {q,e,j), wGT*,jGV*. (1) 

Proposition 2 . Let G be a NSE grammar in canonical form and let Mq the 
corresponding equivalent PDA defined as above. Then there exists a constant 
K > 0 such that in any computation of Mq the string contained in the stack has 
length upper-bounded by K. 
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Proof. Let G = (V, T, P, S) be a NSE grammar in canonical form and let fc = |E| 
and h = max{|7| | A ^ 7 or A ^ 07, A £ V, a £ T'\. One can prove by 
induction on k that for any leftmost derivation with A gV, w GT*, 

and j gV* , one has I7I < hk. Let us set K = hk. By Eq. (P) if 7 appears in the 
stack in some computation, then w"f for some w G T* and then I7I < K. O 

Since any PDA with bounded stack accepts a regular language, then Propo- 
sition |2] gives a new proof of Chomsky Theorem (Theorem [TJ . 

The converse of the previous proposition holds under the hypothesis that all 
the symbols of the grammar G are useful, i.e. they appear inside a derivation of 
a string of L{G). More precisely let G be a grammar in canonical form, whose 
symbols are all useful, and let Mq be the corresponding equivalent PDA defined 
as above. If there exists a constant K > 0 such that in any computation of Mq 
the string contained in the stack has length upper-bounded by K, then G is 
NSE. The proof goes by contradiction. 

Furthermore it is well-known that given a PDA M, one can construct a 
context-free grammar Gm such that L{M) = L{Gm) fl]- We remark that, in 
this construction, the first step consists in building a PDA M' having only one 
state such that L{M) = L{M'). In this construction one can easily prove that 
M' has a bounded stack if and only if M has a bounded stack. Moreover, from M' 
one constructs the grammar Gm which is in canonical form and Gm is such that 
the corresponding equivalent PDA is exactly M'. Therefore, using Proposition 
12] and its converse one can state the following proposition. 

Proposition 3. Let M be a PDA and Gm the corresponding equivalent gram- 
mar. Then M has bounded stack if and only if Gm is NSE. 

It is interesting to observe that the classical constructions on the equiva- 
lence of PDA’s and CFG’s |3] are polynomial time. Therefore, for a CFG G the 
equivalent PDA Mq has polynomial size w.r.t. G. Conversely, for a PDA M the 
equivalent CFG Gm has polynomial size w.r.t. M. Therefore, the representa- 
tions of regular languages by PDA with bounded stack and by NSE grammars 
are equivalent in the size up to a polynomial. 



6 Test for Self-Embedding Property 

We recall that, given a context-free grammar G that generates a language over 
the alphabet S, it is not decidable whether L{G) is regular. (Notice that it is 
undecidable even whether L{G) = E*). In this section we describe an algorithm 
to test whether G is non-self-embedding. If G is non-self-embedding then L{G) 
is regular. The test is based on the association to G of a labelled directed graph, 
whose labels are taken in a properly introduced semi-ring. The SE property for G 
is characterized by the existence of some special paths in the graph. These paths 
are detected by powering some associated matrix, with entries in the semi-ring. 

Let us introduce the semi-ring G = {£, r, b, 0} equipped with operations sum 
and product given by the following tables. 
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Let G be a unit-free context-free grammar, where the set of variables is 
V = {Ai,A 2 , ■ ■ ■ , An}. We associate a graph and a matrix with G as follows. 

The Labelled Production Graph H{G) is a labelled directed graph which ver- 
tices are the variables of G, there is an edge A — > i? iff there exists a production 
rule A-^aB(3 in P, and the label of edge A ^ B is defined in G by: 



lab{A — > B) 



£ if for every A-^aB(3 in P it holds a e, P = e 
r if for every A-^aBP in P it holds a = e, P e ■ 
b otherwise 



The Transition Matrix M{G) is a n x n matrix which entries are defined in G 
as follows, for any i,j G {1, 2, • • • , n}: 






0 if there is no production rnle of type Ai^aAjP in P 

lab{Ai Aj) otherwise. 



The following algorithm tests whether the grammar G is NSE. 

Se-Test(G = (E, E)) 

1 for f ^ 1 to \V\ 

2 do for i ^ j to \V\ 

3 do M(G)*j ^ 0 

4 for each production Ai-^aAjP in P 

5 do if Of yf e and P = e 

6 then if M{G)ij = £ or 0 

7 then M{G)ij ^ £ 

8 else M{G)ij ^ b 

9 if a = e and /? yf e 

10 then if M{G)ij = r or 0 

11 then M{G)ij ^ r 

12 else M{G)ij ^ b 

13 if a yf e and /? yf e 

14 then M{G)ij ^ b 

15 Afi ^ M(G) 

16 M ^ 

17 for f ^ 2 to \V\ 

18 do M* ^ 

19 M ^ M + M^ 

20 for f ^ 1 to |y| 

21 do if Mi^i = b 

22 then return ”G is SE and Ai is SE” 

23 RETURN ”G is NSE” 
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The algorithm Se-Test constructs the transition matrix M{G) in lines 4-14. 
After the execution of the for loop in lines 17-19, we find M = M{G)-^ , where 
for a matrix M and an integer n, we use the notation M-" = M+M'^ + - ■ 

Finally Se-Test either returns the message “G is SE and Ai is SE” when it finds 
that M{G)fY = b for some i, or the message “G is NSE” otherwise. 

It is easy to see that the running time of the algorithm Se-Test on G = 
(V,T,P,S) is polynomial in the size of G (indeed o(|P| -I- 1^1"^)). 

We want now to prove the correctness of the algorithm Se-Test. More pre- 
cisely we claim that the algorithm Se-Test on a unit-free grammar G returns 
the message “G is NSE” iff G is a non-self-embedding grammar. 

A grammar G is self-embedding by definition iff there exists a variable A and 
a derivation A => aAjS with a, /3 yf e. Note that no limitations constraint such 
a derivation, and in particular its length. The following proposition restricts the 
type and the length of derivations to be tested in order to decide whether a gram- 
mar is self-embedding. We say that a derivation aoAo/3o=^Oi^i/3i • • • ^ctnA^Pn 
is simple if A^ = with 1 < h < k < n implies h = 1, k = n. Therefore the 
length of a simple derivation in grammar G = (V, T, P, S) is at most \V\. 

Proposition 4. Let G = (V, T, P, S) be a unit-free context-free grammar. A 
variable A & V is self- embedding iff there exists a derivation A aAp such 
that one of the following two cases holds: 



1. the derivation is simple and a,/3 yf e 

2. the derivation A aAfi can he split into 

/I in ' D m ! a r) on o' m A o'" o" o' 

A a ^ a a Bp p ^ a a a Ap p p 



so that the derivations 
both simple and either 



A ^ a'Bfd' i 
a' a'" = /3" = e 
a",/3"'/3' ^ e 



a'a"'Al3”'P' and B a” Bp" 
r a" = P"'P' = e 
yf e. 



are 



[n\ 

Proof. (Sketch) Let A be a self-embedding variable of G and A aA/3 be a 
derivation with a, /3 y^ e of minimal length n. If the derivation is not simple 
then it can be split according to the statement. One can prove that the resulting 
derivations are both simple by supposing the contrary, removing the derivation 
from a repeated variable to its next occurrence and showing that we then obtain 
some new derivation that contradicts the minimality of n. □ 

Proposition m can be translated in terms of the labelled production graph. 
Let us say that a path in labelled production graph H{G), is of type i (r, resp.) 
if all the edges composing it are labelled t (r, resp.); it is of type h otherwise. 
Proposition S] implies that G is self-embedding iff there exists a vertex X in 
H{G) and either X has a loop of type b and length \V\ at most or X has two 
loops of length \V\ at most, one of type i and the other one of type r. 

Furthermore, this characterization can be easily tested on the transition ma- 
trices. Indeed G is SE iff there exists a variable Ai such that M{G)f]f = b. 
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Example 2. Let G = {V,T, P, S) where V = {S,A,B}, T = {a,b} and P is 
given by: S^aSb/AB, A-^aB/a and B^bA/ Bb/b. Suppose vertices in V are 
numbered as Ai = S, A 2 = A and A^ = B. 



The matrix M{G) is the following one: M(G)= 



b r £ 
0 0 £. 
0 £ r 



The grammar G is SE and all its variables are self-embedding. As an example, 
A is SE and the derivation A^aB^aBb^abAb satisfies case 2) of Proposition 
m Further this derivation yields in H{G) two loops on B each of length \V\ at 
most: i? — !■ i? of type r and length 1 < \V\, and B ^ A^ B oi type £ and length 
2 < \V\. Finally these loops on S = A3 imply M{G)^^ = b. Indeed M{G)f^ = b 
for i = 1,2,3. Applying the algorithm Se-Test on G we have: 



Ml 



M(G); M^= 



bbb 




bbb 


0 £ b 


; M3= 


0 bb 


0 bb 




0 bb 



After the execution of the for loop of lines 17-19, M = = M{G)^. Hence 

the algorithm finds Mi^i = 6 in line 21 and returns: “G is SE and Ai is SE”. 



The previous considerations allow us to claim the correctness of algorithm 
Se-Test and state the main result in this section. 



Theorem 3. It is decidable whether a context-free grammar is NSE or not. 



7 Conclusions and Further Directions 

Next step of our work will be the comparison of the representation of regular 
languages by NSE grammars with all the other known formalisms. In particular 
it would be very interesting to develop efficient algorithms to transform regular 
expressions to/from NSE grammars as well as finite automata to/from NSE 
grammars. It could be interesting to study the complexity of an algorithm to 
transform a finite automaton in a regular expression that has a NSE grammar 
as intermediate stage of the transformation. 



References 

1 . M. Anselmo, D. Giammarresi, S. Varicchio. Non-Self-Embedding Grammars as rep- 
resentation for Regular Languages. Full paper available at 

WWW. mat .uniroma2 . it /"giammarr /Paper s/nse .ps 

2. N. Chomsky. A note on phrase-structure grammars. Information and Control. Vol 
2, pp. 393-395, 1959b. 

3. M. A. Harrison. Introduction to Formal Language Theory. Addison- Wesley, Reading, 
MA, 1978. 

4. J. E. Hopcroft, R.Motwani and J. D. Ullman. Introduction to Automata Theory, 
Languages and Computation - 2nd Edition. Addison- Wesley, 2001. 



Simulation of Gate Circuits in the Algebra of Transients 



Janusz Brzozowski and Mihaela Gheorghiu 



School of Computer Science, 
University of Waterloo, 
Waterloo, ON, Canada N2L 3G1 
{brzozo,mgheorgh}@uwaterloo . ca 



Abstract. We study simulation of gate circuits in algebra C recently introduced 
by Brzozowski and Esik. A transient is a word consisting of alternating Os and Is; 
it represents a changing signal. In C, gates process transients instead of Os and Is. 
Simulation in C is capable of counting signal changes, and detecting hazards. We 
study two simulation algorithms: a general one. A, that works with any state, and 
A, that applies if the initial state is stable. We show that the two algorithms agree 
in the stable case. We prove the sufficiency of the simulation: all signal changes 
occutTing in binary analysis are also predicted by Algorithm A. 



1 Introduction 

Asynchronous circuits, in contrast to synchronous ones, operate without a clock. Interest 
in asynchronous circuits has grown in recent years I416I9I , because they offer the potential 
for higher speed and lower energy consumption, avoid clock distribution problems, 
handle metastability safely, and are amenable to modular design. 

Despite its advantages, asynchronous design has some problems, among them, haz- 
ards. A hazard is an unwanted signal change, caused by stray delays. A hazard may affect 
the correctness of a computation. Because hazards are important, much research has been 
done on their detection. Multiple- valued algebras play an important role here |2l|. Re- 
cently, Brzozowski and Esik introduced an infinite-valued algebra C, which subsumes 
all the previously used algebras ||T|, and a polynomial-time simulation algorithm based 
on C. The algorithm is capable not only of detecting hazards, but also of counting the 
number of signal changes in the worst case; this provides an estimate of the energy 
consumption. 

The purpose of this paper is to compare the Brzozowski-Esik simulation of a circuit 
with the binary analysis of the circuit. We prove that all the changes that occur in the 
binary analysis, are also predicted by simulation. 



2 The Network Model 

The material here is based on m . For an integer n > 0, [n] denotes {!,... , n}. Boolean 
operations AND, OR, and NOT are denoted A, V, and ~, respectively. Given a gate 
circuit with n inputs and m gates, we associate an input variable Xi with each input, 
i G [n], and a state variable Sj with the output of each gate, j G [m]. Input and state 
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Fig. 2. Network graph for circuit of Fig.[T] 



variables take values in the binary domain T> = {0, 1}. Each state variable Si has an 
excitation Si, which is the Boolean function of the corresponding gate. 

Definition 1. A network is a tuple N = {T>, X , S, £), where T> is the domain of values, 
X — {Xi, . . . , X„}, the set of inputs, S = {si, . . . , Sm}, the set of state variables 
with associated excitations S\, . . . , Sm, >^ttd £ C (X x S) U (S x S), a set of directed 
edges. There is an edge between x and y if and only if the excitation ofy depends on x. 
The network graph is the digraph {X U S, £). Note that T> need not be {0, 1}. 



Example 1 . The circuit of Fig.|T]has input Xi , state variables si , S2 , S3, and excitations 

= Xi, 5*2 = Si A S3, = S2 in domain V = {0, 1}. Its network graph is 

shown in Fig. |2] 

A state of N is an m-tuple b of values from T> assigned to state variables si , . . . , Sm ■ 
A total state is an (n+ m) -tuple c = a • 6 of values from T>, the n-tuple a being the values 
of the inputs, and the m-tuple b, the values of state variables. The dot “ • ” separates 
inputs from state variables. 

Each excitation Si is a function of some inputs Xj^ , ■ ■ ■ , Xji^ G X, and some state 
variables Si^,... ,Si^^ G S, i.e., S^ = f(Xj ^ , . . . , Xj, , , . . . , sij, where / : ^ 

V. We also treat Si as a function from into V. Thus, let Si : _j. p be 

Si{a ■ b) = f{aj ^ , . . . , Oj, , , . . . ,61^), for any a ■ b. From now on we write Si for Sp, 

the meaning is clear from the context. 

For any i G [m], the value of Si in total state a ■ b is denoted Si{a ■ b). The tuple 
Si{a-b), . . . , Sm{a • b) is denoted by S{a ■ b). For any a-b,we define the set of unstable 
state variables as U {a ■ b) = {s^ | bi 7^ S'i(a • 6)}. Thus, a • 6 is stable if and only if 
U{a -b) = %, i.e., S{a • b) = b. 

3 Binary Analysis of Networks 

In response to changes of its inputs, a circuit passes through a sequence of states as its 
internal signals change. By analyzing a circuit we mean exploring all possible sequences 
of states. This section describes a formal analysis model introduced by Muller fToll . and 
later called the General Multiple Winner (GMW) model. Our presentation follows that 
of ||3l, but here we refer to the GMW model as binary analysis. 
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In this section we use the binary domain, T> = {0, 1}. We describe the behavior of 
a network started in a given state with the input kept constant at value a G {0,1}", by 
defining a binary relation Ra on the set {0, 1}™ of states of N . For any b G {0, Ij™, 
bRab, if U{a ■ b) = 0, i.e., total state a • 6 is stable, and bRab^ , if U{a ■ b) ^ 0, and 
K is any nonempty subset of U {a ■ b), where by b^ we mean b with all the variables in 
K complemented. No other pairs of states are related by Ra- As usual, we associate a 
digraph with the Ra relation, and denote it Go - In examples, we represent tuples without 
commas or parentheses, for convenience. Thus (0, 0, 0) is written as 000, etc. 

For given a G {0, 1}", and 6 G {0, 1}"* we dehne the set of all states reachable from 
b in relation Ra as reach{Ra{b)) = {c | where R* is the reflexive and transitive 

closure of Ra- We denote by Ga{b) the subgraph of Go corresponding to reach{Ra{b)) . 

Example 2. For the circuit in Fig. [U graph Gq (000) is shown in Fig.EIa), where unstable 
variables are underlined. Note that the graph contains no stable states. Graph Gi(lll) 
is shown in Fig. [3(b). Here there is one stable state. To illustrate hazardous behavior, 
consider path tti = 111, 011, 001. Here S2 changes once from 1 to 0, and S3 does not 
change. However, along path 7T2 = 111, 110, 100, 101, 011, 001, S2 changes from 1 to 
0 to 1 to 0, and S3 changes from 1 to 0 to 1. If the behavior of tti is the intended one, 
then 7T2 violates it. Along 7T2 there are unwanted signal pulses: a 1-pulse in S2, and a 
0-pulse in S3. The first pulse is an example of a dynamic hazard, and the second, of a 
static hazard. These pulses can introduce errors in the circuit operation. 

4 Transients 

While binary analysis is an exhaustive analysis of a circuit, it is inefficient, since the state 
space is exponential. Simulation using a multi-valued domain is an efficient alternative, 
if not all the information from binary analysis is needed. 

The material here is based on [H. A transient is a nonempty word over {0, 1} in 
which no two consecutive symbols are the same. Thus the set of all transients is 

T = 0(10)* U 1(01)* U 0(10)*1 U 1(01)*0. 

Transients represent waveforms in a natural way, as shown in Fig.[4l 
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1010 



10101 



010 



Fig. 4. Transients as words for waveforms 



We use boldface symbols to denote transients, tuples of transients, and functions of 
transients. For any transient t we denote by a(t) and w(t) its first and last characters, 
respectively. A transient can be obtained from any nonempty binary word by contrac- 
tion, i.e., the elimination of all duplicates immediately following a symbol (e.g., the 
contraction of 00100011 is 0101). For a binary word s we denote by s the result of its 
contraction. For any t, t' G T, we denote by tt' the concatenation of t and t'. 

The prefix order on T is denoted < , and is extended to tuples .Foru = (ui,... , ) 

andv = (vi, . . . , v^) inT™, we say that u is a prefix of v and write u < v,ifu^ < v^, 
for all i G [m] . 

Extensions of Boolean functions to functions of transients are defined in (Tl|. Any 
Boolean function / : i?" ^ B is extended to a function f : T" ^ T so that, for any 
tuple (ti, . . . ,t„) of transients, f produces the longest transient when ti, . . . ,t„ are 
applied to the inputs of a gate performing the Boolean function /. We give an example 
of extended Boolean function next. For more details see m. 

Example 3. Let / to be the two-input OR function and f , its extension. Suppose we 
want to compute f(01,010). We construct a digraph 11(01,010) in which the nodes 
consist of all the pairs (t, t') of transients such that (t, t') < (01, 010), and there is an 
edge between any two pairs p, p' only if p <P', and p differs from p' in exactly one 
coordinate by exactly one letter. The resulting graph is shown in Fig.0(a). Also, for each 
node (t,t') in the graph we consider as its label the value /(w(t), w(t')). This results 
in a graph of labels, shown in Fig. |3b). The value of f(01, 010) is the contraction of 
the label sequence of those paths in the graph of labels that have the largest number of 
alternations between 0 and 1. Therefore, f(01,010) = 0101. 

Let z(t) and u(t) denote the number of Os and the number of Is in a transient 
t, respectively. We denote by 0 and 0 the extensions of the Boolean AND and OR 
operations, respectively. It is shown in Q1 that for any w,w' G T of length > 1, 
w 0 w' = t, where t G T is such that 

a(t) = a(w) A a(w'), w(t) = w(w) A tu(w'), and u(t) = it(w) 0 u(w') — 1. 
Similarly, w 0 w' = t, where t G T is such that 

a(t) = a(w) V a(w'), w(t) = u;(w) V w(w'), and z(t) = z(w) 0 z(w') — 1. 
If one of the arguments is 0 or 1 the following rules apply: 



t0O = O0t = t, t0l = l0t = l, 

t0l = l0t = t, t0O = O0t = O. 
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( 0 , 0 ) ►( 01 , 0 ) 0 ►! 

11 11 

( 0 , 01 ) ►( 01 , 01 ) 1 ►! 

11 11 

( 0 , 010 ) ►( 01 , 010 ) 0 ►! 

(a) (b) 

Fig. 5. Graph -D(01, 010) with labels 



The complement t of t G T is obtained by complementing each character of t. For 
example, 1010 = 0101. 

The algebra C = (T, 0 , 0, , 0, 1), is called the change-counting algebra, and is a 

commutative de Morgan bisemigroup [[T] . We also refer to C as the algebra of transients. 

We denote by t o t' concatenation followed by contraction, i.e., t o t' = tt'. The o 
operation is associative, and also satisfies for t, t', ti, . . . , G T and b G {0, 1}: 1. if 
t < t' then 5 o t < 6 o t'; and 2. ti o . . . o t„ = ti . . . t„. 

5 Simulation with Algebra C 

A simulation algorithm using algebra C has been proposed in m ; it generalizes ternary 
simulation f3l5l . We now give a more general version of the simulation algorithm, 
and show how it relates to the original version. This parallels the extension of ternary 
simulation from stable initial state to any initial state l3ll . 

Given any circuit, we use two networks: a binary network N = ({0, 1}, A, 5, 5) 
and the transient network N = (T, X , S, £) having set T of transients as the domain. 
The two networks have the same input and state variables, but these variables take values 
from different domains. A state of network N is a tuple of transients; the value of the 
excitation of a variable is also a transient. Excitations in N are the extensions to C of the 
Boolean excitations in N. It is shown in m that an extended Boolean function depends 
on one of its arguments if and only if the corresponding Boolean function depends on 
that argument. Therefore N and N have the same set of edges. 

Binary variables, words, tuples and excitations in N are denoted by italic characters 
(e.g., s, S). Transients, tuples of transients, and excitations in N are denoted by boldface 
characters (e.g., s, S). We refer to components of a tuple by subscripts (e.g., Si, Si). 

5.1 General Simulation: Algorithm A 

We want to record in the value of a variable all the changes in that variable since the start 
of the simulation, as dictated by its excitation. For variables that are stable initially, since 
the initial state agrees with the initial excitation, the state transient and the excitation 
transient will be the same, so at each step we just copy the excitation into the variable. 
For example, with initial state 0 and excitation 0, if the excitation becomes 01, we set 
the variable to 01, and so on. For variables that are initially unstable, we first record the 
initial state, and then the excitation. The operator that gives us the desired result in both 
cases is o; thus we have newcvalue = initiaLvalue o excitation. 
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Fig. 6. Circuit with finite simulation 



Let a • 6 be a (binary) total state of N. Algorithm A is defined as follows: 

Algorithm A 

s° := b; 
h := 1; 

:= 6 o S(a • s°); 
while (s^ <> ^ 

h “h Ij 
:= b o S(a • 

where o is applied to tuples component-wise, i.e., for all m-tuples u, v of transients, 
u o V = w, where w is such that o v^, for all i G [m]. 

Algorithm A produces a sequence where = 

(sj,S 2 ,... G T™, for all h > 0. This sequence can be finite, if we reach 

g/io — for some ho > 0, or infinite otherwise. For convenience, we sometimes 

consider the finite sequences as being infinite, with for all h > ho- 

lt is shown in [[1] that any extended Boolean function f : T™ ^ T is monotonic 
with respect to the prefix order, i.e., for any x, y G T™, if x < y, then f (x) < f(y). 

Proposition 1. The sequence resulting from Algorithm A is nondecreasing or monotonic 
with respect to the prefix order, that is, for all h>Q,s’^ < s^+i. 

Proof: Since extended Boolean functions are monotonic with respect to the prefix order, 
so are excitations. We proceed by induction on h. 

Basis, h = 0: = b < b o S(a • s°) = s^. 

Induction step', = bo S(a • < bo S(a • s^) = ■ 

For feedback-free circuits, the sequence resulting from Algorithm A is finite. We can 
see this if we order the state variables by levels as follows. Level 1 consists of all state 
variables which depend only on external inputs. Level I consists of all state variables 
which depend only on variables of level < I, and on at least one variable of level I — 1. 
Since the inputs do not change during simulation, level- 1 variables change at most once, 
in the first step of Algorithm A. In general, level-i variables change at most i times. 
Since the number of levels is finite, our claim follows. Thus the running time of A for 
feedback-free circuits is polynomial in the number of state variables. 

For display reasons, in examples of simulation we write binary states as words, but 
during computations they are regarded as tuples. 

Example 4. Consider the feedback-free circuit in Fig. [ 6 l The excitations are: Si = 
X 2 , S 2 = Xi®Si, 83 =^, S 4 = S 2 0 S 3 . For the initial state a • 6 = 11 • 1011, 
Algorithm A results in Table |T] (left). 
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Table 1. Results of Algorithms A and A 
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Table 2. Infinite simulation 



Xi 


Si 


S2 


S3 


state 


0 


0 


0 


0 




0 


01 


0 


01 




0 


01 


01 


01 




0 


01 


01 


010 


s® 







Example 5. For circuits with feedback the simulation sequence may be infinite . Consider 
the circuit with feedback in Fig. [T] The excitation functions are: Si = Xi, S 2 = 
Si (g) S 3 , S 3 = S 2 . We run Algorithm A for this network started in state a-b = 0 ■ 000; 
the resulting sequence of states, which is infinite, is illustrated in Table |2] 

5.2 Simulation with Stable Initial State: Algorithm A 

Algorithm A above makes no assumptions about the starting state a • 6 . If the network starts 
in a stable total state and the inputs change, then we have a slightly simpler formulation 
which we call Algorithm A; this is the version used in jT). Assume N is started in stable 
total state a ■ b and the input tuple changes to a. 

Algorithm A 

a = a o a; 
s° := b] 
h := 1; 

:= S(a • s°); 

while (s^ <> dp 

h != “h 

:= S(a • 



Example 6. We illustrate Algorithm A with the network in Fig. | 6 ] started in stable state 
a-b = 11-0011, with the input changing to a = 10. The result is shown in Table[T](right). 

It is shown in [[T] that the sequence of states resulting from Algorithm A is nonde- 
creasing with respect to the prefix order, i.e., Algorithm A is monotonic. 

For our next result, we modify the circuit model slightly. For each input Xi we add a 
delay, called input gate, with output Si and excitation Si = Xi. This follows the model 
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of US- The following shows that Algorithms A and A are equivalent for any network N 
started in a stable state, provided that N contains input-gate variables. 

Theorem 1. Le?N be a network containing input-gate variables. Let , ... ,s^‘, . . . 
be the sequence of states produced by Algorithm A for N started in the stable (binary) 
total state a ■ b with the input tuple changing to a. Then, for all /i > 0, = s^, where 

s**, s^, . . . , s^, . . . is the sequence of states produced by Algorithm A for N started in 
total state a ■ b. 

Proof: We prove the theorem by induction on h. 

Basis, h = 0. Since s° = 6 = s°, the basis holds. 

First step, h = 1. In states s'* and s° only input-gate variables can be unstable; therefore 
only they can change in the hrst step of A, and of A. One easily verifies that = s^. 
Induction step. For any i € [to], if is an input-gate variable then and 

sf = because in both algorithms the input-gate variables do not change after 

the first step. By the induction hypothesis, we have = s[*. If is not an input-gate 
variable, then it is initially stable in both algorithms, and its excitation does not depend 
on the input tuple, i.e., Si(a ■ x) = Si(a • x), for any (internal) state tuple x. Then 
= Si(a ■ = Si(a • = s^. Hence = s[*, for all i G [to]. ■ 

6 Covering of Binary Analysis by Simulation 

Given the two networks N and N modeling a gate circuit, we perform the binary analysis 
for N and Algorithm A for N, both with the same starting total state a ■ b. The binary 
analysis results in graph Ga{b). Let the state sequence resulting from Algorithm A be 
s°, s^, . . . , s^, . . . , where = (s]*, S 2 , . . . , G T™, for all h > 0. 

We now show that binary analysis is covered by Algorithm A. Take any path from 
the initial state b in graph Ga{b). Suppose the length of the path is h. For each state 
variable Si we consider the transient that shows the changes of that variable along the 
path. We show that this transient is a prefix of the value that variable s* takes in the 
h-th iteration of Algorithm A. 

Example 7. Consider the binary counterpart of the transient network in Fig. [SI with 
Si — X2, S2 = Xi A si, S3 = S2, S '4 = S 2 V S 3 . In Gii(lOll), with the same 

initial total state as in ExamplejH we find a path tt = 1011, 1111, 0111, 0001 of length 
h — 3. If we follow state variable S 3 , for example, it changes from 1 to 0 along this 
path, so the corresponding transient is 10. The value of S 3 in the third step of Algorithm 
A is S 3 = 101, which has 10 as a prefix. In fact, this holds for all variables, since 
( 10 , 010 , 10 , 1 ) < ( 10 , 010 , 101 , 1010 ). 

Definition 2. Let tt = s^, . . . ,s^ be a path of length h > 0 in Ga{b). Recall that each 

is a tuple (sj, . . . , s^). For any i G [to], we denote by crl the transient s° . . . s[*, 
which shows the changes of the i-th state variable along path tt. We refer to it as the 
history of variable Si along the path. We define SJ to be E^, where = Si(a- s^)Si(a- 
s^) . . . Si{a ■ s^), and we call it the excitation history of variable Si along path tt. The 
histories of all variables along tt constitute tuple cr^ = (cr^ , . . . , cr'^). The histories of 
all excitations along tt form tuple . . . , X^). 
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Note that trf and are not always the same. If si is unstable initially, they are obviously 

different, since their first characters are different, that is ^ Si{a ■ s^). Even if the 
variable is stable initially, cr^ and SJ can still be different. 

Example 8. An example of a path in graph Gn( 1011) of the previous example, on which 
a variable changes fewer times than its excitation is path tt = 1011, 0111, 0011, where 
cTg = 1, whereas = 101. 

Let be the state produced by Algorithm A after h steps, and let tt = s°, . . . , 
be a path of length ft, > 0 in Ga{b), with s° = b. We prove that cr^ < s^. 

Proposition 2. Let tt = s^, . . . , s^, be a path in Ga{b), and let tt' = s°, . . . , s^. 
Then, cr'^ < cr'^ o S{a ■ s^). 

Proof: For any variable s, we have one of the following cases. 

Case /, Si changes during the transition from s^ to Then Si must be unstable in 
state s^, i.e., S'i(a-s^) ^ sf,andsf~'’^ = S'i(a • s^), by the definition of binary analysis. 

Hence crj = s^ . . . o = erf o Si{a ■ s^). 

Case II, Si does not change during the transition from to Then = sf 

by the definition of binary analysis. Then cr^ = sf . . sf = sf . .sb = cr'^ < 
erl o Si{a ■ s^). Thus, our claim holds. ■ 

Corollary 1. For any path tt = s°, . . . , s^, in Ga{b), with tt' = s°, . . . , we 
have cr'" < s^ o S'" . 

Proof: cr'" < er"' o S{a • s^) < (. . . ((s° o S{a ■ s°)) o S{a ■ s^)) o . . . ) o S{a ■ s^) = 
s° o {S{a ■ s°) oS{a-sfo...o S{a ■ sf) = s° o S"' . ■ 

Proposition 3. For any path tt = s°, . . . ,s^ in Go(ft), S" < S(a • cr^). 

Proof: Let TTj = s°, . . . , s-’ , for all j such that 0 < j < h. Then cr"° < cr"^ < ■ ■ ■ < cr" ■ 
Thus a ■ cr"° < a ■ cr'"^ <...< a ■ cr", which means that a ■ er '"° , a ■ cr"^ . ,a- cr'" 
is a subsequence q of nodes on a path p from a ■ a(crj) . . . a{cr'f) = a ■ s^ = a ■ cr'"o 
to a ■ cr" in the graph D(a ■ cr"). For any i € [m], we consider the labeling of graph 
D{a-cr'") with Boolean excitation Si. Let A be the sequence of labels ofp. The sequence 
of labels on qis Ei = Sfa ■ s°), Si{a ■ s^), . . . , Sfa ■ sf. Since g is a subsequence of 
p, Ei < A. By the definition of extended Boolean functions, Si(a • cr'") is the longest 
transient obtained by the contraction of the label sequences of paths from a ■ cr"° to 
a ■ cr'" in graph D{a ■ cr'"). Hence A < Si(a • cr’^). By the definition of the excitation 
history, = Ei.lt follows that SJ < Si{a ■ cr"). ■ 

Theorem 2. For all paths tt = s^ , . . . ,s^ in Ga{b), with s^ = b, cr'" < s^, where is 
the (ft + l)st state in the sequence resulting from Algorithm A. 

Proof: We prove the theorem by induction on ft > 0. 

Basis, ft = 0. We have tt = s° = ft = s°; hence cr'" = s'^ = s°, so the claim holds. 
Induction hypothesis. The claim holds for some ft > 0, i.e., for all paths tt of length ft 
from ft in Ga{b), we have cr'" < s^. 
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Induction step. Let 7 = s°, . . . , be a path of length h + 1 from b in Ga{b). 

Then tt = s'^, . . . , is a path of length h, and we have 



O-T' < O 

< bo S(a • cr'^) 
<bo S(a • s^) 

= 



{ Cor.m} 

{ s° = 6 and Prop.|2]} 

{ induction hypothesis, monotonicity of excitations, 
and property of o } 

{ definition of Algorithm A }. 



Corollary 2. If Algorithm A terminates with state s^, then for any path tt from b in 
Ga(b), (T- < S«. 

Proof: Suppose there exists a path tt from b in Ga{b) that satisfies cr^ > , for some 

i G [m]. Let h be the length of ir.lf h < H, Theorem EJshows that a-^ < s^. We also 
have < s^, by Prop.QJ So cr’^ < s^, and in particular cr^ < , which contradicts 

our supposition. If h > H, then Theorem E] states that cr'^ < s^. By our convention, 
= s^. So, again we have cr^ < s|^, which is a contradiction. ■ 



7 Conclusions 

We have proved that all the changes that occur in binary analysis are also detected by 
simulation. In a companion paper (SI we prove a partial converse of this result. 
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Abstract. Finite Deterministic Cover Automata (DFCA) can be ob- 
tained from Deterministic Finite Automata (DFA) using the similarity 
relation. Since the similarity relation is not an equivalence relation, the 
minimal DFCA for a finite language is usually not unique. We count 
the number of minimal DFCA that can be obtained from a given mini- 
mal DFA with n states by merging the similar states in the given DFA. 
We compute an upper bound for this number and prove that in the 



4n— 9+\ 






worst case (for a non-unary alphabet) it is - , ^ . 

-n + 1)! 

We prove that this upper bound is reached, i.e. for any given positive 
integer n we find a minimal DFA with n states, which has the num- 
ber of minimal DFCA obtained by merging similar states equal to this 
maximum. 



1 Introduction 

Finite languages have many practical applications, for example lexical analysis 
(e.g. in programming language compilation) and user interface translations |1 01 
E]. However, the finite languages used in applications are generally very large, 
which need thousands or even millions of states if represented by deterministic 
finite automata (DFA) or similar structures. In PQ, deterministic finite cover 
automata (DFCA) are introduced as an alternative representation of finite lan- 
guages. Experiments have shown that, in many cases, DFCA are much smaller 
in size than their corresponding minimal DFA |B]. 

Let L be a finite language and I the length of the longest word(s) in L. 
Intuitively, a DFCA A for L is a DFA that accepts all words in L and possibly 
additional words of length greater than 1. So, a word w is in L if and only if it 
is accepted by A (as a DFA), and its length is less than or equal to I, in other 
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words, a DFCA will accept a language that ’’covers” the initial language. Note 
that checking the length of a word is usually not an extra burden in practice, 
since the length of an input word is kept anyway in most applications. 

The level of a state is the length of the shortest path from the initial state 
to that state. If two states p and q are similar (as defined in Definition [S]) , then 
one “can get rid of’ one state of the two using the steps described in Theorem [T| 
more precisely if two states p and q are similar and the level of state p is less 
or equal to the level of state q, then we merge state q into p, i.e. we change all 
incoming transitions to state q, to transitions to state p. 

This theorem is actually the basis for the minimization algorithms presented 
in and [2 (these algorithms construct a minimal DFCA from a given DFA 
that accepts a finite language). The algorithms are based on computing the 
similarity relation between states and merging these states. 

One can notice that we can have more than one minimal DFCA for a given 
language L, even if, obviously, the number of states is the same (see Figure 0, 
so, by relaxing the constraints of such an automaton for a finite language, we 
pay the price of losing the uniqueness of the minimal element. 





Fig. 1. Two distinct minimal DFCA for L = {a, ab, ba, aba, abb, baa, bab} 



A natural question may now arise: “How many distinct automata can an 
algorithm that is merging states yield?” 

We answer to this question by proving that for an alphabet with more than 
one letter, this maximum is f^2ko-n+i)\ ’ "'^here n is the number of states in the 

minimal DFA, ko = \ 4’i-9+^V8ra j^-| ^ upper-bound is reached (Theo- 

rem 0. 

In the next section, we give the basic definitions and notation, as well as 
some of the basic results on cover languages and automata. 

2 Preliminaries 

We assume the reader to be familiar with the basic notations of formal languages 
and with finite automata, cf. e.g. 0 , 0 , 0 . 
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In the following we will consider that L C S* is a finite language over an 
alphabet E and I the length of the longest word(s) in L. The cardinality of a 
finite set A is =f^A, the set of words over a finite alphabet E is denoted E*, and 
the empty word is A. The length of a word w G E* is denoted \w\. The set of 
words over E of length at least (respectively, at most, equal) n is denoted H-" 
(respectively T”); \ T". 

For a DFA A = {E, Q, qo, 6, F), we can assume without loosing the generality 
that Q = {0, 1, . . . , n — 1}, n > 2 and qo = 0. 

Please, refer to [I] for the following definitions and results. 

Definition 1. A language L' over E is ealled a eover language of L if L'r\E-^ = 

L. 



Definition 2. A deterministic finite cover automaton (DFCA) for L is a deter- 
ministic finite automaton (DFA) A such that the language accepted by A, i.e., 
L{A), is a cover language of L. 



Definition 3. Let x,y G E* . We define the similarity relation on words: 

X y if for all z G E* such that xz, yz G E-^ , xz G L iff yz G L, 
and we write x y if x V does not hold. 

Definition 4. Let A = {Q, E,S,0,F) be a DFA (or DFCA). For each state 
q G Q, we define level{q) = min{|w| | S(0,w) = q}, and out of the minimal w, 
we denote with XA{q) the smallest in the lexicographic order. 

(level{q) is the length of the shortest path, xa{<i), (in the directed graph associated 
with the automaton) from the initial state to q) 

When the automaton A is understood, we write Xq instead of XA{q)- The 
length of Xq is equal to level(q); level{q) is defined for each q G Q, since we 
consider only connected/minimal DFA. 

Definition 5. Let A = (Q, E, S, 0, F) be a DFCA for L. Let p,q G Q and m = 
m.a,yi{level(j>),level{q)}. We say that p q (the states p and q are similar) if 
for every w G 6{p,w) G F iff S{q,w) G F. 

The relations and are called similarity relations for words, and respec- 
tively for states. 

Lemma 1. Let A = {Q, E,S,0, F) be a DFCA of L. Let s,p,q G Q such that 
level(s) < levelfp) < level{q). The following statements hold: 

1. If s '^AP, s -^A q, then p <?• 

2. If s ^AP, P q, then s ~a <?• 

3. If s ~A P, P'/^Aq, then sq^Aq- 
If s ~A P, sq^Al, then pq^Al- 
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Theorem 1. (Merging theorem) Let A = {Q, S, S, s, F) be a DFCA of L. Sup- 
pose that for p,q G Q, p Q, P ^ q and level{p) < level{q). Then we can 
construct a DFCA, A' = {Q' , E, S' , s, F') for L, such that Q' = Q — {q}, 
F' = F - {q}, and 



[p S{t,a)=q 

for each t G Q' and a G E. 

We say that state q is merged into state p if we can apply the above theorem for 
p and q. 

Corollary 1. A DFCA A is a minimal DFCA for L if and only if no two distinct 
states of A are similar. 

We denote by =a the equivalence relation on states of an DFA A. 

Lemma 2. Let A = {Q,E,S,0,F) be a DFCA of L. Then p =a q implies 
P^A q. 

Theorem 2. Any minimal DFCA of L has the same number of states. 



3 Counting the Number of Minimal DFCA Obtained by 
Merging States from the Minimal DFA 

The following lemma will be useful in the subsequent results: 

Lemma 3. Let A = {Q, E, S, 0, F) be a minimal DFA of a finite language L. Lf 
two distinct states have the same level, they are not similar. 

Proof. Assume that level{p) = level{q), and p ^ q, therefore 5{p,w) G F iff 
S{q,w) G F, for all w G 

Since 5{p,w) ^ F and 6{q,w) ^ F, for all w G follows that 

p =A q, which contradicts the minimally of A. □ 



For A = {E,Q,0,6,F) a minimal DFA accepting the finite language L, we 
denote Siuig = {q G Q \ s q, level{s) < level{q)} for all s G Q. We can 
always obtain a minimal DFCA for L by merging any state s into any state p 
if s G Simp. Denote by Merged = {s G Q | s G Simq, q ^ s} the merged 
states, and by non- Merged = Q — Merged, the states that cannot be merged 
into other states. For any DFCA C = {E,non-Merged,0,6c,Fc) obtained by 
merging states, the transitions between any two states that cannot be merged are 
unchanged, so Scip, a) = g iff S{p, a) = q for all p,q G non- Merged and a G E. 
We use the name of skeleton for these states and transitions. Any minimal DFCA 
obtained by merging similar states will have the same skeleton. 
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Lemma 4. With A, SirUs, Merged and the skeleton defined in the previous para- 
graph, we have: for any state q G non-Merged — {0}, there is at least one state 
p G non-Merged such that 5c{p,a) = q, for some a G E. 

Proof. Assume that there exists a skeleton state q that is reachable only from 
merging states. Since Xq is the shortest string with S{0,Xq) = q, consider the 
state s G Merged and a word w, with 5(0, w) = s, 6{s,a) = q, and wa = Xq. 
It is obvious that level{s) = |ti;| < level{q). Since s can be merged (because of 
s G Merged) with some state p such that (level{p) < level{s)) in the DFCA, we 
have 5c{0,Xq) = 5c(p,a,). So Xqy G L iff 5{p,ay) G Fq, for all y G 
i.e. q 5{p,a). Since level{5{p,a)) < level{q), and q G non-Merged — {0}, it 
follows that Sc{p, a) = q. □ 



Corollary 2. Any two distinct DFCA obtained by merging similar states are 
not isomorphic. 

Proof. Since all DFCA contain the same skeleton, the only transitions that are 
changed in the minimal DFA are those from skeleton states to merged states. 
Therefore, different transitions between the same states of the skeleton will pro- 
duce different languages, otherwise we would have two distinct minimal DFA 
for the same language, since any minimal DFCA is also a minimal DFA for the 
corresponding cover language. □ 

Therefore, the maximal number of minimal DFCA that can be obtained by 
merging states is 



ff{s G non-Merged \ p G Sims}. 

p^Merged 



( 1 ) 



Lemma 5. In the worst case, each state that is merged must have at least an 
incoming transition from a state that is not merged. 

Proof. Assume that there is a state q, such that all incoming transitions to q 
are from Merged. Then this state does not influence the resulting DFCA (it 
becomes unreachable after merging the other states in Merged), and therefore, 
its similarities ff{s G non-Merged \ q G Sims} must not be considered in 
formula[T]for counting the number of DFCA produced by merging similar states. 

□ 

The merged states have incoming transitions from other skeleton states so 
that distinct DFCA can be produced. 

One has to notice from the merging theorem (Theoremd]) that the transitions 
are redirected after each merging. This is actually making the difference between 
the minimal DFCA obtained from the DFA: all the minimal DFCA have the same 
states, the same transitions that appear in the minimal DFA between these states 
(i. e. the skeleton ) plus some “loop backs” produced by merging states. These 
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loop backs are the only difference between the minimal DFCA obtained in this 
way. So, if we make sure that each merging state has at least one incoming 
transition from skeleton states, then we add at least a new transition to the 
skeleton in the minimal DFCA. 

Lemma 6. In the worst case, the sink state has the level less than ^ + 1, or it 
will not help in producing any more different minimal DFCA. 

Proof. Assume that sink state, sink has level I + 1; then the last final state, 
last, has level 1. If there are at least two final states, then last G Merged, 
sink G Merged, and 5{s, a) yf sink, for all s G Merged and a G S, so we apply 
Lemma O 

If last is the only final state, then any two states p, q G Q — {sink, la.st} are 
not similar. Indeed, if they have the same level we apply Lemma El if level{p) < 
level{q), then S{q,w) = last, for some w G S* . Of course, we also have \xq\ + 
|w| = I (cannot be less since the level of last is I and cannot be longer than I, 
because this is the longest length of the words in L). From level{p) < level{q) 
we deduce that |a;p| + |w| < \xq\ + |?n| = I, so 5{p,w) yf last and |?n| < I — 
max(level{p,level{q))), therefore p q. The number of minimal DFCA is in 
this case only n — 1, where n is the number of states in the minimal DFA. □ 

In the following we will find an expression for the maximum of formula [Hand 
an example of a minimal DFA that reaches the computed bound. 

Remark 1. To produce several distinct minimal DFCA we need to have “choices” 
in merging states. The only possibility when we can have such “choices” (accord- 
ing to Lemma |T|) is the following: a high level state is similar with several lower 
level states and the lower level states are not similar with each other; in this case 
we have more than one possibility of merging, leading to distinct non-isomorphic 
DFCA. From the Theorem [T] we know that we always merge higher level states 
into lower level states. 



Theorem 3. a) The maximal number of distinct, non-isomorphic minimal de- 
terministic finite cover automata obtained from merging states from a minimal 
DFA with n states is: ( 2 fco-° 7 !+i)! ? where ko = |~ 4n-9-H/8 nyU--| ^ 

b) For any n > 1 there is a minimal DFA with n states that has the number 
of minimal DFCA obtained from merging states exactly the number given in part 
a) (i.e. the upper bound is reached). 

Proof. We first prove the following statements that will help us in counting the 
maximal number of distinct DFCA obtained by merging similar states. All of 
the following statements are proved in the worst case: 

1. The skeleton states are present on all levels up to fcg — 1, where fcp is the 
highest level a state can have in the minimal DFA. 

2. The states that are merged have the highest levels possible. 

3. There is at most one state that will be merged on any level in the DFA. 
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4. We can have at most one skeleton state for each level. 

5. The number of DFCA obtained by merging states from the minimal DFA 

which satisfies the aforementioned properties and has the highest level for 

its states fco is 

(1) From the Lemma [Hand Lemma Owe get that for each state (but the 
start state) there is an incoming transition in our original DFA from a skeleton 
state, so if the highest level a state can have in the DFA is fco, then we have at 
least a skeleton state on each level from 0 to fcp — 1 (otherwise, the maximum 
level in DFA is less than fco). 

(2) Because in the worst case the number of similarities has to be as large 
as possible, for each state q we must have as many states as possible such that 
q S SirUs- Therefore, the level of merged states have to be higher than as many 
skeleton states as possible. 

(3) Assume that there are at least two states p,q G Merged on some level 
r. They cannot be similar, because of Lemma E] and cannot be both similar 
with another state of a lower level, because of Lemma[T] (if they are both similar 
with a lower level state then they will be also similar with each other). So, the 
set of states of our automaton is partitioned into at least two disjoint parts, all 
the states that are similar with p are in one of these partitions (and maybe we 
have also a set of states non similar with any of these states of level r) . Because 
{s I p G Sims} n {s I g € Sims} = 0, the formula for the number of minimal 
DFCAs has lower factors, since (2r)! > (r!)^. It is clear that the worst case is 
when we have a single partition (according to the states on level r), meaning 
that in the worst case we have at most one merged state on each level (one can 
also prove this by induction on the maximal level of the automaton). 

(4) If we have two or more skeleton states on the same level we are not in 
the worst case, since one can get more (possible) similarities just by taking one 
of the skeleton states from the same level and making it a new start state (level 
0), the old start state being ’’pushed” at level one. Then the level of every state 
is increased by one, thus making possible that more merged states are similar 
with this newly level 0 state (all the merged states that had lower level than this 
state can now be similar with it), i.e. the sets ({s | p € Sims})peMerged, may 
have one element more. 

(5) According to the previous remarks, we have at least one skeleton state 
on each level up to level fcp — 1, and we have at most one skeleton state on each 
level, so we must have exactly one skeleton state on each level, from 0 to fco ~ 1- 

We also know that we cannot have more than one merged state on a level 
and these states must have their levels as large as possible. 

Let us count how many distinct DFCA we produce in this worst case. First, let 
us see how many states are merged: we have fcp skeleton states on the levels from 
0 to fco — 1 and possibly the sink state, so the rest of n — fco — 1 are merged states. 
These states must have the highest levels, so they are at level fco, fco — 1, . ■ . down 
to some level fci, one for each level, therefore fci = fco — (n— fcp — 1) + 1 = 2fco— n+2. 
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The maximal number of skeleton states that can be similar with a state 
p € Merged is the number of states that have their levels less than p, which is 
exactly level{p) since the skeleton states are one per level. 

Then, the maximal number of DFCA obtained in this way is: 



(2fco — n + 2){2ko — n + 3) . . . (fco — l)fco = 



W- : ^ 1 ^ 

— for — < Kq < n. 

(2fco-n+l)! 2 - 



To finish proving the first statement of the theorem, we have to find for each 
given n the ko = I such that the formula -^7 for — < ko < n reaches 

(2feo — n+lj! 2 

its maximum. After an easy analysis, we obtain that this is equivalent to finding 

the smallest fcn for which the inequality — — — -7 > + ^)- holds. 

(2fco - n + 1)! “ (2/co - n + 3)! 

We distinguish two cases: 1) n = 2m (n is even) and 2) n = 2m + 1 (n is 
odd). 

Case 1) n = 2m: 

•f fu r 1 ■ fu r (m + ky. (m + Zc+l)! 

we rewrite the formula m the form — r- > — — 77 — , [k = kn — m), 

{2k + iy - (2/C + 3)! ^ u 

which implies that 8/c^ + 18/c + 10 — n > 0, and the solutions are all integers 
k > (we consider only the positive solutions, of course). 

Because we are looking for the smallest ko with this property, we get ko = 

|~ 4n— Q+v^Sn+l ~| 



Case 2) n = 2m + 1: 

we rewrite the formula in the form — *''(2fc+4y^' “ ko — m — 1), 

which implies that 8/c^ + 26fc + 21 — n > 0, and the solutions are all integers 

^ — 13+-\/8n+l 

So, in this case, ko becomes: ko = m + k + l = [ 4"-94V8q +li ^ 

Let us prove now the b) part of the theorem by giving a construction for the 
worst case minimal DFA: 

Consider that the DFA has n states and that the maximum is reached in the 
formula (2fco-»+i)! ’ some ko computed using the formula from part a), then 
we know that all the minimal DFCA have fco + 1 states. 

We construct the minimal DFA A = {Q, E, 6, s, F) as follows: 
g = {0, 1 , . . . n - 1 }, A = {a, 6 } s = 0 , A = g - {ko}, 

5{p, a) = p + 1 , for all p = 0 , . . . , n — 2 , p yf fco — 1 , n — 1 , 

5{p, 6 ) = p + 1 , for all p = 0 , . . . , 2ko — n, and p = ko, ■ ■ ■ ,n — 2, 

5{p, b) = p + n — ko, for all p = 2ko — n, . . . ,ko — 

5{ko — 1, a) = n — 1, 6{n — 1, a) = n — 1, 5{n — 1, 6) = n — 1. 

This automaton has the first /cg + 1 states non similar (the skeleton states) 
and the next states are similar with all the states that have the level lower than 
them (but the sink state); one can easily see that the given DFA is minimal and 
satisfies all the properties given before for the worst case, so the actual number 
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Fig. 2. The minimal DFA with maximum number of DFCA for n = 9, fco = 
p 274^^1 = 5, so the number of distinct (non-isomorphic) DFCA is: = 60 



of minimal DFCA that are obtained by merging similar states from the given 
minimal DFA is □ 

{2ko-n+iy. 



4 Final Remarks 

One may notice that there are minimal cover automata for a finite language 
L that cannot be obtained by merging all the similar states from the minimal 
DFA accepting L; such an example is depicted in Figure [TJ A subsequent paper 
will deal with counting all the minimal non-isomorphic DFCA for a given finite 
language L; we expect that the result from the current paper to be also useful 
in the general case. 
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Abstract. Regex are used in many programs such as Perl, Awk, 
Python, egrep, vi, emacs etc. It is known that regex are different from 
regular expressions. In this paper, we give regex a formal treatment. 
We make a distinction between regex and extended regex; while regex 
present regular languages, extended regex present a family of languages 
larger than regular languages. We prove a pumping lemma for the 
languages expressed by extended regex. We show that the languages 
represented by extended regex are incomparable with context-free 
languages and a proper subset of context-sensitive languages. 

Keywords: Regular expressions, regex, extended regex, formal lan- 
guages, programming languages 



1 Introduction 

It is known that the practical “regular expressions” are different from the theo- 
retical ones. The practical “regular expressions” |2] are often called regex. Regex 
are used in many environments, like Perl, Python, Awk, egrep, lex, vi, emacs, 
etc. Regex were developed under the influence of theoretical regular expres- 
sions. For example, the regex in Lex |3] are similar to theoretical regular ex- 
pressions. However, regex are quite different in many other environments. Many 
regex in use now can express a larger family of languages than the regular lan- 
guages. For example, Perl regex PQ can express Li = {a" 6a" | n > 0} and 
L 2 = {ww I w G {a, 6}*}. However, Perl regex cannot express the language 
L 3 = {a^b"^ I n > 0}. It is relatively easy to show that a language can be ex- 
pressed by a regex. For example, Li can be expressed by a Perl regex (a*)6\l 
and L 2 by ((o|6)*)\l. However, it is usually difficult to show that a language 
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cannot be expressed by certain devices. For example, how do we prove that L3 
cannot be expressed by a Perl regex? It is clear that there is a need for some 
formal treatment of regex. This is the main purpose of the paper. 

Regex are defined differently in different environments. However, in some 
sense, there is a common subset of the definitions. In this paper, we first give a 
formal definition of regex and extended regex according to the common subset of 
the definitions given in different environments. Then we prove a pumping lemma 
for extended regex languages. We also show that the family of extended regex 
languages is a proper subset of the family of context-sensitive languages, and it 
is incomparable with the family of context-free languages. 

2 Basic Definition of Regex and Extended Regex 

Let S be an ordered set of all printable letters except that each of the following 
letters is written with an escape character \ in front of it: \ , 

and -I-. In addition, E also includes \n and \t as the new line and tab character, 
respectively. In the following, we give the definition of regex as well as the lan- 
guage it defines. For an expression e, we use L{e) to denote the set of all words 
that match e, i.e., the language e defines. 

Basic form of regex: 

( 1 ) For each a e £■, a is a regex and L{a) = {a}. Note that for each x e 

{(J){ I A. •.?.*)+}. \x € E and is a regex and L(\x) = {x}. In 

addition, both \n and \t are in E and are regex, and L{\n) and L(\t) 
denote the languages consisting of the new line and the tab, respectively. 

( 2 ) For regex ei and 62, 

(ei)(e2) (concatenation), 

(ei)|(e2) (alternation), and 
(ci)* (Kleene star) 

are all regex, where L((ei)(e2)) = L(ei)L(e2), T((ei)|(e2)) = L(ei) U ^(62), 
and L((ei)*) = (L(ei))*. Parentheses can be omitted. When they are omit- 
ted, alternation, concatenation, and Kleene star have the increasing priority. 

( 3 ) A regex is formed by using ( 1 ) and ( 2 ) a finite number of times. 

Shorthand form : 

( 1 ) For each regex e, (e)-|- is a regex and {e)+ = e(e)*. 

( 2 ) The character means any character except ‘\n’. 

Character classes : 

( 1 ) For Oil, 0*2, ... , o** G A, t > 1 , [oii 0*2 ... o**] is a regex and [oi*Oi2 . . . o**] = 
n*i |o*2 1 • . ■ |n** . 

( 2 ) For ai,Uj G E such that Oi < aj, [oi-Oj] is a regex and [o^-Oj] = 
Oi|o*+i| . . . \aj. 

( 3 ) For Oi*,o*2, ...,Oi* G E, t > 1, [~ai*ai2...a**] is a regex and [~Oi*Oi2 ...a**] = 
biilhj ... \bi^, where {6** , 6^2 , J = E - {oi* , 0*2, ..., o**}. 
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(4) For ai,Uj € S such that at < aj, [~ ai-aj] is a regex and [~ai-aj] = 
biilhj ■ ■ ■ \bi^, where {h^,bi^,. . .,bij = S - {ai,ai+i, . . 

(5) Mixture of (1) and (2), or (3) and (4), respectively. 

Anchoring: 

(1) Start-of-line anchor 

(2) End-of-line anchor $. 

We call an expression that satisfies the above definitions a regex. Note that 
the empty string is not a regex. 

Example 1. The expression * $” is a regex, which matches any line including 
the empty line. 



Example 2. The expression “[A-Z] [A-Za-zO-9]*” matches any word starting with 
a capital letter and followed by any number of letters and digits. 

It is clear that regex can express only regular languages. However, the fol- 
lowing construct, which appears in Perl, Emacs, etc., is not a regular construct. 

Back reference : 

‘\to’, where m is a number, matches the content of the mth pair of paren- 
theses before it. 

Note that the pairs of parentheses in a regex are ordered according to the 
occurrence sequence of their left parentheses. Note also that if the mth pair of 
parentheses are in a Kleene star and ‘\m’ is not in the same Kleene star, then 
‘\m’ matches the content of the mth pair of parentheses in the last iteration of 
the Kleene star. A more formal definition is given later. We first look at several 
introductory examples. 

Example 3. The expression “(a*)6\l” defines the language 

{a”6a” I n > 0}. 

Example 4- The expression “{The){ file ([0 — 9][0 — 9])).*\3” matches any string 
that contains “The file ” followed by a two-digit number, then any string, and 
then the same number again. 



Example 5. For the expression e = (a * b) * \1, aabaaabaaab € L{e) and 
aabaaabaab ^ L{e). 

There appears to be no convenient way to recursively define the set of ex- 
tended regex a satisfying the condition that any back reference in a occurs after 
the corresponding parenthesis pair. Thus below we first define an auxiliary no- 
tion of semi-regex and the extended regex are then defined as a restriction of 
semi-regex. 




80 



C. Campeanu, K. Salomaa, and S. Yu 



Definition 1. A semi-regex is a regex over the infinite alphabet 

S U {\m I m G N} 

where N is the set of natural numbers. Let a be a semi-regex. The matching 
parenthesis pairs of a are numbered from left to right according to the occurrence 
of the left parenthesis (the opening parenthesis) of each pair. 

A semi-regex a is an extended regex if the following condition holds. Any 
occurrence of a back reference symbol \m (m G N) in a is preceded by the 
closing parenthesis of the mth parenthesis pair of a. 

Below we define the matches of an extended regex and the language defined 
by an extended regex. Intuitively, a match of an extended regex a is just a word 
denoted by the regex when each back reference symbol \m is replaced by the 
contents of the subexpression (3m corresponding to the mth pair of parentheses in 
a. Due to the star operation, a given subexpression occurrence (3m may naturally 
“contribute” many subwords to a match. Following the convention used e.g. in 
Perl, \m will be replaced by the contents of the last (rightmost) occurrence of 
(3m appearing before this occurrence of \m. 

The condition that each back reference symbol occurs after the corresponding 
parenthesis pair guarantees that a match as defined below cannot contain circular 
dependencies. 

We denote the set of occurrences of subexpressions of an extended regex a 
as SUB(a). Distinct occurrences of an identical subexpression are considered to 
be different elements of SUB(a). A match of an extended regex a is defined as 
a tree Tq, following the structure of a. Naturally, due to the star operation, 
may make multiple copies of parts of the structure of a. 

Definition 2. A match of an extended regex a is a finite (directed, ordered) 
tree T^. The nodes of T^ are labeled by elements of S* x SUB(a) and T^ is 
constructed according to the following rules. 

(i) The root ofT^ is labeled by an element (w,a), w € S* . 

(ii) Assume that a node u of Ta is labeled by (w,(3) where (3 = {(3i){(32) G 

SUB(a). Then u has two successors that are labeled, respectively, by (wi,(3i), 

i = 1,2, where W\,W 2 G S* have to satisfy w = W\W 2 . 

(iii) Assume that a node u of Ta is labeled by {w,(3) where (3 = {(3\)\{(32) G 

SUB(a). Then u has one successor that is labeled by one of the elements 

{w,(3i), i G {1,2}. 

(iv) Assume that a node u ofTa is labeled by (w,(3) where (3 = (/3i)* G SUB(a). 
Then u has k>l successors labeled by 



{wi,Pi),...,{wk,(3i) 

where w\ - ■ ■ Wk = w, Wi ^ \, i = 1, . . . ,k. If w = X, u has a single successor 
u\ labeled by {X,(3) and u\ is a leaf of Ta. 

(v) If {w,a), a G S, occurs as a label of a node u then necessarily u is a leaf 
and w = a. 
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(vi) If {w, \m) occurs as a label of a node u then u is a leaf and w G S* is 
determined as follows. Let /3m be the subexpression of a enclosed by the mth 
pair of parentheses and let be the previous node ofT^ in the standard left- 
to-right ordering that is labeled by an element where the second component 
is Pm that precedes u in the left-to-right ordering of the nodes. (Note that 
all nodes where the label contains Pm necessarily independent and hence 
they are linearly ordered from left to right.) Then we choose w to be the first 
component of the label of . Finally, if Pm does not occur in the label of 
any node ofT^ then we set w = X. 

The language denoted by an extended regex a is defined as 

L(a) = {w G S* I {w,a) labels the root of some match Tq,}. 

Above the somewhat complicated condition (vi) means just that a back ref- 
erence symbol \m is substituted by the previous occurrence of the contents of 
the subexpression Pm corresponding to the mth parenthesis pair. It is possible 
that Pm does not appear in the label of any node of Ta if Pm occurs inside a 
Kleene star and in (iv) we choose zero iterations of the star operation. In this 
case the contents of \m will be A. 

If the nondeterministic top-down construction of T^ produces a leaf symbol 
violating one of the conditions (v) or (vi), then the resulting tree is not a well- 
formed match of a. The fact that a back reference symbol \m has to occur after 
the expression corresponding to the mth parenthesis pair guarantees that in (vi) 
the label of can be determined before the label of u is determined. 

Example 6. Let a = (((a|6)*)c\2)*. Then L{a) = L\ where 

L\ = {wcw I w G {a, 6}*}. 

In the above examples, and in what follows, for simplicity we often omit 
several parenthesis pairs and only the remaining parentheses are counted when 
determining which pair corresponds to a given back reference. 

3 A Pumping Lemma for Extended Regex Languages 

There are several pumping lemmas for regular languages |3I5| . which are useful 
tools for showing that certain languages are not regular languages. In the follow- 
ing, we prove a pumping lemma for extended regex languages and give several 
examples to show how to use the lemma to prove that certain languages are not 
extended regex languages. 

For a string (word) x G S* , denote by |a:| the length of x in the following. 

Lemma 1. Let a be an extended regex. Then there is a constant N such that if 
w G L(a) and |ui| > N , there is a decomposition w = Xoyxiyx 2 . ■ . Xm, for some 
m > 1, such that 

1- \xoy\ < N 
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\y\ > 1 , 

3. xoy^xiy^ . . . Xm G L for all j > 0. 

Proof. Let N = |a|2* where t is the number of back-references in a. Let lu G L(a) 
and |w| > N . By the definition of N it is clear that some part of w matches a 
Kleene star or plus in a that has more than one iteration, because each back 
reference that does not appear inside a Kleene star can at most double the length 
of the word it matches. 

Let w = xoyz and y is the first nonempty substring of w that matches a 
Kleene star or plus. In order to satisfy |a;oy| < IV, we let y match the first 
iteration of the star or plus. In other words, we have e* or e+ in the extended 
regex; the y in x^y matches e and is the first iteration of the star or plus. Then 
we have |a:oy| < N and \y\ > 1. There are the following two cases: Case 1, e* 
(or e+) is not back-referenced. Then it is clear that the lemma holds for m = 1. 
Case 2, e* (or e+) is back-referenced. Let z = xiyx2yx^ . . . Xm where j/’s are all 
the back-references of y. Then we have w = Xoyxiyx2yx3 . . . Xm, and it is clear 
that xoy^ x\y^ X2y^ ■ ■ ■ Xm G L{a) for all j > 1 . 

Note that in the case that e in (e)* (or (e)-l-), rather than e* (or e+), is 
back-referenced and y matches e, the lemma clearly holds for m = 1. 



Example 7. Consider some special cases of regex for the pumping lemma. Let 
Cl = (a&| 6 a)(\l)*. Then the constant N for the pumping lemma is |ei| x 2 = 24. 
Since the Kleene star is not referenced, any word w that matches e\ and |w| > N 
can be decomposed into xyz such that \xy\ < N, |?/| > 1, and xy^ z G L{e\) for 
all j > 0. For example, w = (ba)^^. Then x = ba, y = ba, and z = ( 6 a)^^. 

Let 62 = bab{{a\b)c\2)*. Then N = 28. Again the Kleene star is not ref- 
erenced. So, any word w G L{e2) and |w| > N can be decomposed into xyz 
such that \xy\ < N, and \y\ > 1, and xy^ z G L(e2) for any j > 0. For ex- 
ample, let w = babacabcbbcbacabcbacaacaacaaca. Then x = bab, y = aca, and 
z = bcbbcbacabcbacaacaacaaca. 

Let 63 = {a*)b\lbb\lbbb. Then iV = 14 x 2^ = 56. Any word w G L{e^) and 
|r<;| > N can be decomposed into Xoyxiyx2yx^ such that \xQy\ < N, \y\ > 1, and 
xoy^ xiy^ X2y^ X3 G ^( 63 ) for all j > 0. 

Next we show how the pumping lemma is used. 

Example 8. The language L = {a”&" | n > 0} cannot be expressed by an ex- 
tended regex. 

Proof. Assume that L is expressed by an extended regex. Let N be the constant 
of Lemma [U Consider the word a^b^ . By Lemma [TJ a^b^ has a decomposi- 
tion xoyxiyx2 . ■ . Xm, rn > 1, such that (1) \xQy\ < N, (2) |y| > 1, and (3) 
xoy^xiy^ . . .Xm G L for all j > 0. According to (1) and (2), y = for some 
i > 0. But then the word x^y^xiy^ . . . Xm is clearly not in L. It is a contradiction. 
Therefore, L does not satisfy Lemma[T]and, thus, L cannot be expressed by any 
extended regex. 
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Similarly, we can prove that the language {a^b^c^ | n > 1} is not an ex- 
tended regex language. As an application of the pumping lemma we get also the 
following. 

Lemma 2. The family of extended regex languages is not closed under comple- 
mentation. 



Proof The language 



Li = {a™ I TO > 1 is not prime } 

is expressed by the extended regex (aaa*)\l(\l)* . Assume that the complement 
of Li, LI, is an extended regex language and apply Lemma |T]to L^. This implies 
that there exist ni > 0, U 2 > 1 such that for all j > 0, 

ani+j-n2 g 

This is a contradiction since it is not possible that ni j ■ ri 2 is prime for all 
j > 0. 

We can note also that the language Li in the proof of Lemma[2]is an extended 
regex language over a one-letter alphabet that is not context-free. In particular, 
this means that there are extended regex languages that do not belong to the 
Boolean closure of context-free languages. 

4 Other Properties of Regex and Extended Regex 
Langnages 

Here we prove that every extended regex language is a context-sensitive language. 
We also show the relationship between extended regex languages and context- 
free languages. 

Theorem 1. Extended regex languages are context-sensitive languages 

Proof. It suffices to show that each extended regex language is accepted by a 
linear-bounded automaton (LB A), i.e., a nondeterministic Turing machine in 
linear space. For a given extended regex, if there is no back reference, we can 
simply construct a finite automaton that will accept all words that match the 
regex. 

For handling back references, we need store each string that matches a subex- 
pression that is surrounded by a pair of parentheses for later use. If there are to 
pairs of parentheses in the extended regex, we need to buffers. We store the part 
of the input strings that match the content of the ith. pair of parentheses in the 
ith buffer. 

For a given extended regex, we first label all the parentheses with their ap- 
pearance number. For example, the extended regex “(a * (6a*)\2)5” would be- 
come “(lO * ( 26 a*) 2 \ 2 )i 6 ” . Note that for a fixed extended regex the number of 
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parenthesis pairs is a constant that does not depend on the input word, and 
hence the numbering can be done easily in linear space. Then we construct a 
nondeterministic finite automaton (NFA) which treats each labeled parenthesis 
and each back reference as an input symbol. 

Then we build a Turing machine which works according to the NFA as follows. 
It reads the input symbols and follows the transitions of the NFA. For a {i 
transition, it does not read input symbol but starts to store the next matching 
input symbols into the tth buffer. For a )i transition, it also does not read 
any input symbol but stops storing the matching input symbols into the Ah 
buffer. Whenever there is a back reference in the extended regex, say “\f”, it 
just compares the string with the content of the ith buffer. It accepts the input 
string if there is a way to reach a final state when just finishes reading the input. 
Note that if the ith pair of parentheses are in a Kleene star, it may store input 
symbols in the ith buffer several times. Then “\i” always matches the latest 
word stored in the Ah buffer. 

Each buffer needs at most the space of the input word, and there are a 
constant number of buffers. Thus, an extended regex language is accepted by an 
LBA and, thus, is a context-sensitive language. 



Theorem 2. The family of extended regex languages is incomparable with the 
family of context-free languages. 

Proof. The language L = {a” 6a” 6a” | n > 1} is clearly an extended regex 
language. It can be expressed as “(a-|-)6\I6\I”. However, L is not a context-free 
language. We know that {a”6” | n > 0} is a context-free language, but it is not 
an extended regex language as we have proved in Section 3. 



Theorem 3. The family of extended regex languages is a proper subset of the 
family of context-sensitive languages. 
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Abstract. One of the new approaches to data classification uses finite 
state automata for representation of prefix codes. An important task 
driven by the need for the efficient storage of such automata in memory 
is the decomposition of prefix codes into prime factors. We investigate 
properties of such prefix code decompositions. A linear time algorithm 
is designed which finds the prime decomposition F1F2 ■ ■ ■ Ft of a regular 
prefix code F given by its minimal deterministic automaton. 



1 Introduction 

A simple basic operation of formal languages is their concatenation. However, 
the complexity of the inverse operation of decomposing a language L into a 
nontrivial concatenation L1L2 is not well understood. A concatenation is trivial 
if one of Li, L2 consists exactly of the empty string. A language is prime if 
it cannot be expressed as a non-trivial concatenation of two languages. The 
prime factorization is the decomposition into prime factors. It has been proved 
in P that the problem of primal! ty is undecidable for context-free languages. 
Although for regular languages this problem is decidable [T], the decomposition 
of a regular language into primes may not be unique. Algorithmic issues of this 
problem are also of interest: it has been left open in |2 whether primal! ty of a 
language L is an NP-complete problem, even in the case of a finite language L. 

In this paper we consider decomposition problems for the class of regular 
prefix eodes. Regular prefix codes are languages recognized by finite automata 
and such that no word is a prefix of another. The representation of the code is 
a deterministic finite automaton {dfa, in short). From a practical point of view 
precisely this type of languages forms the basis of some classification algorithms 
in hardware-based technology. Our main results show that decomposition prob- 
lems for regular prefix codes are much easier than for general regular languages. 
Our algorithms have linear time complexity. The factorization of a regular prefix 
code into prime prefix codes is unique, while the factorization of an arbitrary 
regular language could be not unique. Our results are especially interesting for 
infinite regular prefix codes. 

The practical applications of prefix code decomposition show that the classi- 
cal theory of finite automata and formal languages can play an important role in 
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modern technology of data classification. Data classification is one of the central 
problems in network information processing. Packets arriving from a communi- 
cation channel have to be accepted or rejected, or more generally, divided into 
several classes, based on their content. There are three fundamental approaches 
to data classification, used in modern technology: network processor approach, 
used by such companies as Intel, Motorola, and Vitesse, content addressable 
memory (CAM) approach |2], used, e.g., by Netlogic, IDT, and Kawasaki, and 
dedicated chip approach based on finite state automata, used, e.g., by Agere, 
Raqia, and Solidum Systems Corp. [3]. The first approach is software based, 
while the last two are implemented in hardware. Among the three, the approach 
using a dedicated chip is by far the fastest: it enables classifying data packets 
at wire speed, i.e., it permits to complete the classification as soon as the last 
bit of the packet is read. In particular, the mechanism of data classification 
used by Solidum Systems Corp., see http://www.solidum.com, is built around 
programmable classification processors that can be configured to closely inspect 
packets for vital information. Based on a programmable state machine technol- 
ogy and on an openly distributed pattern description language PAX PDL [4151 
E], Solidum’s scalable and forward-compatible classification processors simulta- 
neously parse, identify, and tag packets. The information collected can then be 
used to make intelligent routing and switching decisions. Prefix codes are of par- 
ticular importance to data classification, due to the fact that in such languages 
a classification decision concerning an input word can be made on-line without 
the risk of having to change the decision later, when more bits arrive. Language 
decomposition and computation of prime components are the most important 
theoretical tools used in this technology. Each classification task requires a spe- 
cific finite state automaton to carry it out. However, due to the large size of 
these automata and to their large number, it is impossible to store all these ob- 
jects simultaneously. Fortunately, prefix codes corresponding to these automata 
can be decomposed into primes, i.e., undecomposable prefix codes. It turns out 
that, in practice, those prime factors are often the same for many prefix codes. 
Consequently, it is enough to store a relatively small number of simple automata 
which are building blocks for all others. Whence the necessity of efficient compu- 
tation of prime components of a given prefix code. Another advantage of having 
a decomposition of prefix codes into primes is that it facilitates updates. Very 
often a classification task yielded, e.g., by a routing problem, changes over time 
|7] which requires an update of a part of the corresponding automaton. In such 
situations we often need to replace only some of the prime factors and are able 
to reuse the remaining ones. 

Prefix codes over a finite alphabet constitute a free monoid, see, e.g., [H]. 
Thus, every prefix code admits a unique decomposition into primes. In this paper 
we present an 0(n)-time algorithm for finding prime components of a regular 
prefix code F given by the minimal dfa {min-dfa) Ap oi size n. (We assume that 
the size of the alphabet is bounded, hence the number of transitions is linear in 
the number of states.) The algorithm finds a decomposition 



F = F1F2 ■ ■ ■ Fk 
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into a concatenation of prime prefix codes in linear time (and thus it also de- 
cides if the given prefix code is prime). When is a finite prefix code, this 
decomposition is relatively easy to find and one can prove that the sum of sizes 
of the min-dfas of all the prime components is 0{n). However, in the general 
case, when Ap contains loops, the sum of sizes of the prime components may 
sometimes be Despite that, our decomposition algorithm runs in time 

0(n) and the computed succinct representation of the decomposition is of 0{n) 
size. 

2 Definitions and Basic Facts 

We consider the alphabet E = {1,2, . . . , m}. The language containing all words 
will be denoted by E* and the empty word by e. We say that u is a prefix (left 
factor) of w if there exists u such that vu = w. Let E, F be languages over E. 
We define: EF~^ {v € E* \ 3w € F vw € E} and F~^E =* {u S if* | G 
F wv G E}. F~^E and EF~^ are the left and right quotient of language E by 
language F, respectively. If {m} is a singleton set, then we will write wE, w~^E, 
and Ew~^ instead of [w}E, [w}~^E, and E{w}~^, respectively. 

Definition 1. A language F C E* is called a prefix code iff F does not eontain 
two different words such that one is a prefix of the other. Prefix eodes {e} and 0 
are ealled trivial. 

In this paper we consider only regular prefix codes, i.e., the ones which are 
recognized by finite automata. 

Definition 2. A deterministic finite automaton (dfa) A = {Q,S,s,F) consists 
of a finite set Q of states, a partial transition function 6 : Q x E Q, an initial 
state s G Q, and a set of final states F C Q. 

Lemma 1 (fS|). For any language E, the following properties are equivalent: 
(1) E is a prefix eode. (2) E is empty or the minimal dfa (min-dfa) accepting 
E has a single terminal state without outgoing transitions. 

Let El, E 2 , .. . , Ejn be languages over E = {1,2... , m}. The switch composi- 
tion, denoted [Ei : E 2 : ... : Em], is defined by: 



E.g., if if = {1, 2, 3}, we have [0 : |e} : {!}] = 10 U 2{e} U 3(1} = (2, 31}. 

Prefix codes are closed with respect to concatenation, switch composition, 
and left quotient by a word. I.e., if Ei, E 2 , .. . , Em are prefix codes and w is a 
word then E 1 E 2 , [Ei : E 2 : . . . ■ Em], and w~^Ei are prefix codes. 

Definition 3. A non-empty prefix eode which cannot be represented by a eon- 
eatenation of two non-trivial prefix eodes is ealled a prime prefix code ( or simply 
a prime}. 
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Since non-empty prefix codes form a free monoid, we have the following 
lemma. 

Lemma 2 . Every non-empty prefix code E admits a unique decomposition, E = 
P1P2 ■ ■ ■ Pk, into non-trivial primes. 

Let E, F be two prefix codes. We say that G is a common right divisor of 
E and F, denoted by G G CD{E,F), if and only if there exist two prefix codes 
El, El such that EiG = E and FiG = F. Note, that at least one common right 
divisor exists, it is {e}. Also, 0 is a common right divisor of E and F if and only 
\i E = F = %. We say that G G CD{E,F) is the greatest common right divisor, 
GCD{E, F), if there is no other prefix code G' G CD{E,F) such that G' = G"G, 
for a prefix code G". We have GCD(if, %) = E for each prefix code E. 

One can prove that the greatest common right divisor always exists and it is 
unique. 

Theorem 1 . Let Fi, F2, . . . , be prefix codes. Then [Fi : . . . : F^] is a prime 
if and only z/GCD(Fi,... ,Fm) = {e}. 

Proof. We prove the left-to-right implication by contradiction. 

If GCD(Fi, . . . , Fm) yf {e} then two cases are possible: 

1 . GCD(Fi, . . . , Fm) = 0 , then [Fi : . . . : F^] = 0 . This is not a prime. 

2 . GCD(Fi, . . . ,Fm) = F is a non-trivial prefix code. Thus, [Fi : . . . : Em] = 
[F{F : ... : Ff,^F] = [F{ : . . . : F,^] F for some prefix codes F{, . . . , F^. This 
is not a prime. 

The right-to-left implication is also proved by contradiction. 

Suppose [Fi : . . . : F^] is not a prime, i.e., [Fi : . . . : Em] = FP for some non- 
trivial prefix codes F and P. Thus, P G CD(Fi,... ,Fm) and hence {e} yf 
GCD(Fi,... ,F„). □ 

Since in data classification applications prefix codes are mostly generated 
using concatenation and switch operations the following observation is useful to 
store prefix codes as lists of prime factors. Let E,F, Ei, . . . , Em be prefix codes 
over an m-letter alphabet, and let VP{F) denote the list of prime factors of F. 
We have 



VT{EF) = VT{E) ■ VT{F) ( 2 ) 

FF([Fi : . . . : Em]) = [EiG~^ ■....■. F^G"!] • FF(G) ( 3 ) 

where G = GCD(Fi, F2, . . . , Em). 

If two prefix codes F\ and F2 are given as lists of their prime factors, finding 
their greatest common right divisor, GCD(Fi, F2), is straightforward and consists 
in identifying the longest common suffix of Fi and F2 seen as words over primes. 

If FF(Fi) = F1F2 . . . Ffc and VT{F2) = G1G2 ...Gr, then GCD(Fi, F2) = 
EiEi+i . . . Ek, where i is the smallest integer such that F^Fi+i ... Ek is a suffix 
of G1G2 . . . Gr. If there is no such i then GCD(Fi, F2) = {e}. 




Prime Decompositions of Regular Prefix Codes 



89 



3 Prime Decomposition of Regular Prefix Codes 

From the results of the previous section it is relatively easy to find the prime 
decomposition of a finite prefix code. In order to find an efficient and general 
algorithm for decomposition of all regular prefix codes we will introduce below 
an approach based on so-called D- articulation states. In [l] the general problem 
of decomposition of languages into a concatenation of other languages was ad- 
dressed. It has been proved that every decomposition A = B ■ C oi & regular 
language A {B and C not necessarily regular) implies a decomposition A = B' -C 
such that B' and C are regular with B' B and C C C. 

Theorem 2. Let E,F,G be non-empty prefix codes such that E = EG. E is 
regular if and only if F and G are. 

Proof. Since regular sets are closed with respect to concatenation, one implica- 
tion is obvious. 

Let Q{A) =* {w~^A \ w € if*} \ 0, for any language A. The set Q(A) is 
exactly the set of states of minimal deterministic (possibly infinite) automata 
for A, see e.g., [9l Theorem 8.1] or [TO) . 

If if = EG with E, F, G being non-empty prefix codes, then there is an 
injection i : Q{F) ^ Q{E), e.g., i(w~^F) w~^E = w~^{FG). If E is regular, 
then Q{E) is finite. Because of the existence of i, Q{F) must be finite, i.e., F is 
regular. On the other hand, G is regular since G = F~^E, see 0 Proposition 3.1] 
or [in]. □ 

In order to determine the primality of a prefix code F we will need to look for 
D- articulation states of min-dfa Ap accepting F. 

Definition 4. Let Ap = {Q,S,i, {i}) be min-dfa recognizing a non-empty prefix 
code F. A state s € Q is called a D-articulation state if every path from i to t 
passes through s. 

Lemma 3. A D-articulation state of a deterministic finite automaton A re- 
mains a D-articulation state after minimization of A. 

In the remainder of this section we will show how the prime decomposition 
problem of a prefix code may be efficiently solved through the determination of 
D-articulation states of its min-dfa. 

To determine the prime decomposition of a prefix code F we suppose that it 
is given as its min-dfa Ap. We will suppose that its size (i.e., its number of states) 
is equal to n. The decomposition algorithm will find a sequence of prefix codes 
Fi, F 2 , . . . , Fk such that F = F 1 F 2 . . . F^. Each prefix code Fi, for i G [1, fc], 
will be reported by the algorithm as a dfa Ap.. As in some cases the sum of 
sizes of Ap., for i G [1, A:], may be as high as I7(n^), cf. Example |T] at the end of 
Section S] it would seem to be the lower bound for the time complexity of such 
an algorithm. However, each Ap^ may be obtained as a so-called projection of 
Ap. A projection oi Ap will have the same set of states but the initial and final 
states may be different. This will allow us to construct an 0(n) time algorithm 
reporting all factors. 
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Definition 5. Let A = (Q, <5, i, {t}) be a dfa aecepting a prefix eode F , and let 
s, s' be two states from Q. We denote by A(s,s') the finite automaton having 
the same states as A, with s designated as initial state and s' - the only terminal 
state, such that all the out-coming transitions of s' have been discarded. 

More formally, A(s, s') "== {Q, S', s, {s'}), where, for all x G Q, a G E: 

x'( \ defined if x = s' 

{ S{x, a) otherwise 



Note that after replacing i by s and t by s' we get a dfa, but not minimal, in 
general, since it may contain useless (sink) and unreachable states. In particular, 
if s' ^ t, t becomes a sink state as the new final state s' is unreachable from t. 
Any such projection produces a prefix code automaton. 

To show the correctness of the primality testing algorithm we first prove the 
following. 

Theorem 3. A prefix code F is prime if and only if its min-dfa has no D- 
articulation state. 



Proof. To prove the right-to-left implication we suppose that F = F1F2 and we 
will conclude that Ap must have a D-articulation state. Let ii and ti be initial 
and terminal states, for prefix code automaton Ap., i = 1 , 2 , respectively. The 
construction of Ap is the following. 

Let A! be a dfa whose set of states is the union of states of Ap,^ and Ap^, 
where ti and ^2 are identified as the same state. The set of transitions of A! is 
the union the transition sets of Fi and F2. Finally we take as its initial state 
i = ii and its terminal state t = t2- It is obvious that A accepts F1F2. Note that 
the automaton is actually deterministic and Fa = min(A') is the automaton for 
F. By construction ti (= S2) is a D-articulation state in A and thus also va. A p 
(Lemma [3l). 

For the left-to-right implication we suppose that a is a D-articulation state 
of Ap = (Q, 5 , i, |t}) and we will prove that F is not a prime. Consider Ap^ = 
Ap(i,a) and Ap^ = Ap{a,t). Note that Ap.^ may have a loop on state a but 
may not. Take any word w G F and the process of its acceptance hy Ap. 
Let w = xy, where x is such a prefix of w, that reading the last symbol of w we 
visit state a for the first time. Clearly, x is accepted by Ap^ and y is accepted 
by Ap.^. Thus A is a concatenation of two prefix codes defined by the automata 
Ap{i, a) and Ap{a,t). □ 

The following lemma follows directly from the definition of the D-articulation 
state. 

Lemma 4. Let A = (Q,S,i,{t}) be a dfa with at least two different D-articula- 
tion states Oi and «2. Then one of the two states, say ai, is such that on every 
path from i to t in A the first occurrence ofai always precedes the first occurrence 
of 02. Moreover, on each of these paths, the last occurrence of ai precedes the 
last occurrence of 02 . 



Prime Decompositions of Regular Prefix Codes 



91 



The sequence of D-articulation states of a prefix code automaton A will be 
called a spine of A. Lemma 21 yields an ordering on the states of the spine of A. 
Thus we will write Oi ^ 02 if on every complete path in A, the first occurrence 
of fli precedes the first occurrence of 02 (and, by consequence the same happens 
to the last occurrences of both states). 

Theorem 4. Let Ap be the min- df a for a prefix eode F, having k D-articulation 
states. Then F is decomposable into k -\- 1 primes, F = F 1 F 2 . . . Tfc+i- 

Proof. Let {oi, . . . , a^} be the set of D-articulation states of Ap = (Q, S, i, {t}) 
ordered by ^ relation. The proof goes by induction on fc. It is sufficient to prove 
that F = F'F", such that F' is accepted by Api with k—1 D-articulation states 
and F” is a prime prefix code. Let Ap> = Ap{i,ak) and Apn = Ap{ak,t). 

By construction, F = F'F" . Observe that, as o/j is the last D-articulation 
state, by Proposition |4l there is no other state which belongs to every complete 
path in Apn. Thus, by Theorem [3l F" is prime. We show that Api has k — 1 
D-articulation states. By construction of Ap/, all paths in Apt are exactly sub- 
paths in A p from i to the first occurrence of Ofc. 

However, by Lemma E) on every path in Hi?, before the first occurrence of 
Ok there was the first occurrence of each state for 1 < i < A: — 1. Thus each 
state Oi, l<j<A; — lisa D-articulation state in Api and no other state in Api 
is a D-articulation state. □ 



It follows from Theorems El and 01 that finding the prime decomposition of a 
prefix code is equivalent to determining the D-articulation states of its min-dfa. 
Hence we concentrate on the latter task in the next section. 



4 Finding D- Articulation States 



In this section we present an algorithm which finds all D-articulation states in a 
min-dfa H of a prefix code, in linear time in the size of A. Let G be an n-node 
directed graph corresponding to a min-dfa {Q, S, i, {t}). (Recall that we consider 
bounded size alphabets, hence the number of edges in G is 0{n).) We assume 
that every node of G is on some directed path from i to t. 



Definition 6. Let v be a node and tt = (pi, . . . ,pk) be a simple directed path in 
G. Define FIRST[u,7t] to be the smallest index j for which there is a directed path 
from Pj to V node-disjoint with tt. Similarly define LAST[u,7t] to be the largest r 
for which there is a directed path from v to Pr node-disjoint with tt. 



We first give a general overview of the algorithm. 
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Algorithm F ind-D-articulation-states 

1. Find any simple directed path tt = (pi, . . . ,pk) from i to t. 

Call this path the backbone. 

{ Use Depth- First- Search El on G starting from i. } 

2. For each node v ^ n compute FIRST[u, tt] and LAST[u, tt]. 

3. For each node v ^ tt add the edge {pj,Pm) to the graph G, 
where j = FIRST[u, tt]) and m = LAST[u,7t]. 

4. Output D-articulation states of G as those nodes pi G tt for 
which there is no arc pjPm with j < I < m. 

We now give a more detailed description of the algorithm, show that it runs 
in linear time, and prove its correctness. 

Computation of left and right labels. 

Label all nodes of tt = (pi, ... ,pk) with 0. Initially, all other nodes are un- 
labeled. We define the following procedure Modified-DFS{s), for s G [1, k]. It 
is a version of DFS which starts from ps using edges outside tt. It backtracks 
whenever a labeled node is encountered. Assign label of value s to all newly 
visited nodes which were previously not labeled. Call the label assigned in 
this way to a node v, “the left label of u”. 

The whole algorithm computing left labels for all nodes outside the backbone 
is as follows: 



for s = 1 to fc do Modified-DFS{s) . 

Similarly we compute right labels, processing the reverse graph of G (where 
all edges are reversed) in the reverse order, i.e., from s = k down to 1. 



Lemma 5. For every node v, its left label is equal to FIRST[u, tt] and its right la- 
bel is equal to LAST[u, tt] . Moreover, by the algorithm above, the values FIRST[u, tt] 
and LAST[u,7t] are computed in linear time. 

Proof. By construction, node v obtains left label I if there exists a path from pi 
to V outside of the backbone, and there is no such path starting at pj, for j < 1. 
Hence, I — FIRST[u, tt]. A similar argument holds for LAST[u, tt] and the right 
label of v. The algorithm takes linear time since each edge is processed at most 
twice. □ 

Implementation of Part 4 of the main algorithm. 

At the beginning of Part 4 we have the graph augmented by the arcs added in 
Part 3. Define an integer vector. A, of length k, as follows. A\j] is the highest 
index of a node of the backbone tt = (pi, ... ,pk), which is the end of an 
arc starting in pj. (We consider edges of the backbone as arcs, so if no arc 
starting from pj was added, we have A[j\ = j -F 1.) Computing A takes linear 
time in the number of added arcs, i.e., 0(n). Now compute the vector B, 
where B[j\ = max(A[l], . . . , A[j — 1]). This vector can be computed in time 
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0{k). B\j] = r means that the highest index of a destination node of an arc 
starting in ps, for s < j, is Pr- Hence B[j\ = j iff pj is a D-articulation state. 
(Clearly nodes outside of the backbone cannot be D-articulation states.) 

It follows that the entire algorithm runs in linear time in n. The correctness 
of the algorithm follows from the following lemma. 

Lemma 6. A node is a D-articulation state iff it is one of the backbone nodes 
Pi and there is no edge PjPm with j < I < m at the end of the algorithm. 

Proof. Suppose that z; is a D-articulation state. Then v must belong to the 
backbone, for otherwise the backbone would be a path from i to t omitting v. 
Suppose, for a contradiction, that there is an arc PjPm with j < I < m. This arc 
existed in the original graph or has been added in Part 3. In both cases it follows 
that there is a path from i to t which omits v. This contradicts the definition of 
a D-articulation state. 

Now suppose that v is not a D-articulation state and let i? be a path from i 
to t that does not contain v. Moreover, suppose that v belongs to the backbone, 
and V = Pi . Let j be the largest index smaller than I such that pj belongs to R, 
and let m be the smallest index larger than I such that Pm belongs to R. If there 
is no edge PjPm in the original graph then the segment of the path R between 
Pj and Pm has length at least 2. Let u be any node in this segment. In this case 
the arc PjPjm was added in Part 3 when considering node u. Hence the arc pjPm 
is present upon the completion of the algorithm. □ 

This implies the following result. 

Theorem 5. Let Ap be an n-state min-dfa accepting a prefix code F. The se- 
quence of all D-articulation states of Ap may be found in 0(n) time. 



Example 1. Consider the following prefix code F over S = {1, 2}, 

F = {12+1‘^2+ ... + 1"-22)* 

for some n > 1. Its min-dfa has n states as shown in Fig. [H Each of its states 
is a D-articulation state. According to Theorem U F = F 1 F 2 . . . Fn-i, where 
Ap. = Ap{si, Si+i), for z S [1, n — 1]. Note that Ap. contains as its set of states 
{si, S 2 , . . . , Sz-i-i} and it is minimal. The sum of sizes of Ap^, over i G [1, n — 1], 
is2-|-3-|-...-l-n = D{n^). The prefix code F is decomposed into n — 1 primes 
given by the following formula: 

Fi = 1 and F, = (2 Jf, ) * 1, for z £ [2, n - 1] 

where Fj = F 1 F 2 ■ ■ ■ F^. 

For example, for n = 4 we have the prime decomposition: 

VF {{12-Gl'^2)*ff) = 1 • (21)*1 • (21(21)n)M 
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Fig. 1. A prefix code min-dfa whose all states are D-articulation states 



From the proof of Theorem 2] it follows that the size of the prefix code, 
expressed as the number of states of its min-dfa, is at least as large as the size 
of each of its prime components. However, the sum of sizes of prime components 
may be substantially larger than the size of the original prefix code. Example [T] 
shows that for a prefix code of size n, the sum of sizes of its prime components 
may be Nevertheless, our algorithm permits to represent all the prime 

components collectively within an 0{n) data structure. Theorems |3] 21 and |5] 
imply the following result. 

Theorem 6. A prime decomposition of any prefix code accepted by an n-state 
min-dfa can be found in time 0(n) and be represented in 0(n) space. 
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Abstract. Finite-state transducers can be used to map a language onto 
a set of values. This paper proposes an alternate representation method 
for such a mapping, consisting of associating a finite-state automaton 
accepting the input language with a decision tree representing the output 
values. The advantages of this approach are that it leads to more compact 
representations than transducers, and that decision trees can easily be 
synthesized by machine learning techniques. 



1 Introduction 

For the application of large-scale dictionaries two major problems have to be 
solved: fast lookup speed and eompaet representation. Using automata we can 
achieve fast lookup by determinization and compact representation by mini- 
mization. For providing information for the recognized words one can use the 
transducers{i.e., automata with outputs) |1 S|1 ,5|1 f)l |. The goal of this work is to 
propose a competitor to the transducers. Our method combines automata and 
machine learning theories with following desired properties: 

1. The number of the states (and hence the transitions) representing the input 
language of our method is less than compared to the transducers. 

2. In constructing transducers, we have to represent every transition by a data 
structure of at least two fields: one for the symbol representing the transition, 
another for the label-value (for short label) associated with the symbol. So 
in order to properly calculate the outputs, the labels set needs to have the 
algebraic structure e.g., semiring in the case of weighted automata Ell- 
in our approach the transitions are not labeled with outputs; the cost of 
exploring the automata is low. 

3. In most applications {e.g., those of using part of speech tagging) there may 
be (many) identical output values. When you use the transducers there is no 
way to save the amount of space for those identical informations, whereas in 
our approach such economy is allowed. 

In order to explain intuitively the benefits of our method, we give a very 
simple example in the following. Let V = {Asia,Europa} be the output values 
of three following countries: K = {Iran,Iraq, Ireland}. In order to determine the 
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output values of any element of K one can learn the decision tree based on the 
mutual informations; if the key (of K) ends by ’n/q’ then retrieve ’Asia’ else 
’Europa’ . 

Our solution is to represent the keys in a finite-state automaton and the 
output values in a decision tree, respectively. Our method has been applied to 
a number of applications: Who is Who? text classification |7j, processing 
the nationality words and their tagging Enni along with the discussions with 
respect to (w.r.t.) two trie p[j, ternary search [2j and perfect hashing 011- 

In [H] we just briefly sketched our idea without giving the algorithms which 
will be described in Section 3 of the present paper. In Section 2, we recall some 
basic notions of both the automata theory and the machine learning used in our 
algorithms. The experiments and comparisons of our method w.r.t. the trans- 
ducers are outlined in Section 4. 



2 Preliminary Considerations 



For more informations on automata we refer the reader to | 19| . For a general 
reference on machine learning, the reader is referred to |14| . 

Recall that an acyclic finite-state automaton is a graph of the form g = 
(Q, E, S, qo, F) where Q is a finite set of states, S is the alphabet, qo is the start 
state, F C Q is the accepting states. (5 is a partial mapping 6 : Q x E — > Q 
denoting transition. If a G E, the notation 6{q,a) = _L is used to mean that 
5(q, a) is undefined. Let E* denotes the set containing all strings over E including 
zero- length string, called the empty string e. The extension of the partial S 
mapping with x G E* is a function 6* : Q x E* — > Q and defined as follows: 

^*(q,^)= q 



5*(q,ax) = 



6*{6{q,a),x) if 6{q,a) yf T 
T otherwise. 

A finite automaton is said to be (n,m) -automaton if |Q| = n and |E| = m where 
E denotes the set of the edges (transitions) of g. The property 5* allows fast 
retrieval for variable-length strings and quick unsuccessful search determination. 
The pessimistic time complexity of 5* is 0{n) w.r.t. a string of length n. 

A decision tree (dt) is a direct acyclic graph of nodes and arcs. At each node 
a simple test is made; at the leaves a decision is made with respect to the class 
labels (values in our case). The dt is introduced in the machine learning (ML) 
community m- The suitable input for the classification algorithm in ML is the 
lists of a fixed number of attributes of the class at hand and their values as shown 
in the left part of Table[T]for 10 restaurants using four attributes. At the top of 
this tree expressed by way of three rules situated in the right part of Table H] 
one can see the attribute age; this indicates that it is most likely that a decision 
can be made quickly if one first asks for the age of a restaurant. If the answer 
to this question is ‘new’ or ‘old’, then the profit can be predicted by ‘down’ or 
‘up’, respectively. If the answer is ‘midlife’, then another question must be posed, 
about the presence of competition. After this answer is known, the profit trend 
can be determined. 
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Table 1. The profit trends of 10 Slow-and-Fast Foods & Its decision tree 



Profit Age CP^ 


Type 


Test 


Result 


down 


old no 


CK 


If Age is new 


Then Profit is up. 


down 


midlife yes 


CK 


If Age is old 


Then Profit is down. 


up 


midlife no 


HB 


If Age is midlife 


Then If 3 Competition 


down 


old no 


HB 




Then Profit is down. 


up 


new no 


HB 




Else Profit is up. 


up 


new no 


CK 


Entropy(profit,age) 


= 0.4d0 


up 


midlife no 


CK 


Entropy (profit , CP ) 


= 0.8754887502163469d0 


up 


new yes 


CK 


Entropy(profit,type) 


= l.OdO 


down 


midlife yes 


HB 


Prob(profit,up,age,new) 


= l.OdO 


down 


old yes 


CK 


Prob(profit, up, age, midlife) 


= 0.5d0 



“ CP (competition), HB (hamburger) and CK (Chelo-Kabab: Iranian hum). 



Table 2. Backward attribute-based Data and Decision Tree. 



bj ^5 ^4 ^3 b2 bi KV 


Solution-Path 


Question 


KV 


* * * I r an Tehran 


(6i n kv Tehran) 


bi — n? 


Tehran 


* * ★ I r a q Baghdad 


(fei q kv Baghdad) 


for = q? 


Baghdad 


I r e 1 and Dublin 


(fei d kv Dublin) 


bi = d? 


Dublin 



3 Algorithm 

We refer to a key as a sequence of characters surrounded by empty spaces but 
containing no internal space. We may use key, word, interchangeably. We write 
bi to denote the character (from right-to-left) of a key e.g., bi and bs of Iran’ 
are ’n’ and the null character (shown by * for convenience), respectively. A key- 
value (or output value noted by kv) is also a sequence of characters surrounded 
by empty spaces which may have one or more internal spaces. For instance, one 
may assign to a word a unique ambiguity class, {e.g., “Adj/Noun”) although a 
class represents a set of alternative key- values that a given word can occur with. 
Input: The user-file is of the following customary form: / = {{ki,Vi)\i = 
1,2,... ,p} where each {ki,Vi) represents a pair with ki and Vi standing for 
a key and key-value, respectively. 

Outputs: The (n,m)-automaton and the decision tree. Table |2]shows an example 
of the learned dt w.r.t. fi = {(Iran, Tehran), (Iraq,Baghdad), (Ireland,Dublin)}. 
Using outputs: Given a string input x, if it can be spelled out using g i.e., 
6*{qo,x) G F then uses dt to search kv. For instance, kv of ’Iran’ w.r.t. fi is 
’Tehran’ because the first solution-path of the learned dt of Table |2] provides 
us such result. Note that the dt w.r.t. ±2 = {(Iran, Asia), (Iraq, Asia)} has a 
unique solution-path i.e. (kvAsia) - no condition {i.e., question) is required to 
discriminate the key-value. 
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3.1 Main Algorithm 

Preprocessing: First, by way of the function Form Automaton we form g 
accepting the input language (K the set of ki of /). Then, we call the function 
FormInputFor Learning along with two input parameters: i (i.e., the length 
of longest key - calculated by FormAutomaton) and /. The output is the 
suitable data table of p x (^ + 1) elements, namely table for being used as 
training examples for learning the key- values. Finally, we learn how the output 
values can be synthesized by dt. This is done by the function Classify, using 
table and kv, where kv denotes the key-value. So, kv is the target attribute of 
the learning process of this phase. 

Algorithm 1: Construction from / = {{ki, Vi)\i = 1, 2, . . .p} and its utilization. 



func FormDictionnary(f) / / Preprocessing 

i -I— 0; {i is the global variable standing for the length of the longest key.} 
g <— Form Automaton (f); {Outputs: £ and a (n-m) automaton.} 
table <— FormInputForLearning(/,£); {Output: training samples.} 
dt ^ Classify (kv,table); {kv: target attribute. Output: decision tree.} 
cnuf 

func UsingDictionary(x)//Processing, a; is a variable-length string in- 
put. 

if S*{qo, x) = q such that q G F then 
kv ^ SearchValue(x,dt); {See Table 12 } 

else 

kv <— nil; {x is unknown.} 

end if 
cnuf 



Processing: It works as follows: if x can be spelled out using g, then the func- 
tion SearchValue uses the learned dt to output the key-value. SearchValue 
examines the current solution-path and stops the search when it succeeds. This 
function performs no backtracking in its search. Once it selects an attribute to 
test at a particular level in the tree, it never backtracks to reconsider this choice 
due to the no existence of ’missing values’ (of the attributes) in our method. 



3.2 FormAutomaton() - Construction from Keys 

J. Daciuk, S. Mihov, B. W. Watson and R.E. Watson jS] in their works describe 
an elegant algorithms for the incremental construction of minimal acyclic 
finite state automata and transducers from both sorted and unsorted data. We 
adapted their former one such that the length of the longest key be calculated for 
being used later in the construction of suitable input for learning the dt (see l3.3l) . 
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Algorithm 2: Construction from sorted keys {K) (Adapted from i)- 



func FormAutomaton() / / Input: K - the set of sorted keys. 

Register ^ 0; £ ^ 0 {£ stands for the length of the longest key.} 
for key G K do 

Ik <— Length(fcej/); £ ; Max(Zfc,f); cp ^ CommonPrefix(A:ej/); 

Is <— S*{qo, cp); cs ^ key[cp + 1 . . . /fc];{ls:Last state; cs:Current suffix} 

if HasChildren(ls) then ReplaceOrRegister(ls); fi 

AddSuffix(ls,cs); {Creates a branch extending out of the dictionary.} 

end for 

ReplaceOrRegister (go); 

cnuf / / Pessimistic time complexity is 0{p x £ x logn). 

func CommonPrefix(key) / / The longest prefix of the word to be 
added. 

return key[l ... a] : a = maxi : 3q G Q 6*{qo, key[l . . .i]) = q 
cnuf// Pessimistic time complexity is 0{£) 

funcReplaceOrRegister(state) //Executes at most 0{£) times for each 
key. 

child ^ LastChild(); (Returns the outgoing transition most recently added.} 
if not MarkedAsRegister(c/ii/d) then 

if HasChildren(ls) then ReplaceOrRegister(ls); fi 

if 3q G Register {q child) Same equivalence class - See [5].} then 

DeleteBranch(child); LastChild(state) ^ q; 

else 

Register <— Register U {child}; MarkedAsRegister(chiZd); 
end if// Memory for the register of states is proportional to n. 

end if 

cnuf / / Pessimistic time complexity of adding a state to a register is 0{£). 
Please refers to [5| for additional descriptions. 

3.3 Learning the Output Values 

As the left part of Table El suggests, we have to construct the suitable input from 
f wherein the star-symbol in each column corresponding to the null characters 
of those keys having the length inferior to the longest key, namely ’Ireland’. 
Note that the meaning of the star-symbol should not be confused with unknown 
attribute-value which may causes to an impossible classification. The function 
FormInputFor Learning generate the suitable input. Now we are ready to learn 
the decision tree. This is done by the function Classify using the global input 
variables: table and our target attribute, namely kv. If kv has only one value, 
then the decision tree is trivial. Otherwise, construct a tree with the chosen 
splitting attribute as its root, with its descendants given by a recursive call 
to Classify on the subtables with the splitting attribute taking each possible 
value. A subtable is obtained by GetSubTable which returns the subtable of the 
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original table, of entries for which the given attribute takes the given value. That 
is, in the new table, only the rows for which attribute takes value are included, 
and the column for attribute is excluded. 

Algorithm 3: Learning the output values from / = {{ki,Vi)\i = 1 . . .p} 



func FormInputForLearningO // Input: / = {{ki,Vi)\i = 1... ,p} and 

e. 

table[l][£ + 1] = ’kv’; {kv is our target attribute.} 
for j from £ down to 1 {Naming £ first attributes.} do 
table[l][j] = concat(’b’, j); 

end for/ / First row of the data table contain the f + 1 attributes, 
for i from 2 to p do 

table[i][j] ^ SelectCharacter{key, j,£) (An asterisk or the j-th character 
of the key depending on the length of the key and £.} 
table[i] + 1] ^ vp, 

end for 

cnuf // Output : table[p][£ + 1] e.g. Table 2. 

func ClassifyO / / Inputs: data table {i.e., table[p][£+l]) and kv. 

^ ^ GetPossibleValues(6); {Returns a list of all the values that appear in the 
table for a given attribute (5), duplicates deleted.} 
if \<P\ = l{Does kv has only one value? If any, call it Vk-} then 
InstallNodeOfTree(kv,Wfc); {The leaf of the tree. } 
else 

b ^ BestAttribute(); {Select the best one to test next.} 

for (j) € ^ do 

InstallNodeO fTree{b, 4>); table ^ GetSubTable{4>); ClassifyQ; 

end for 
end if 

cnuf / / Output: the decision tree, 
function Best Attribute () 

0 ^ attributes();{Returns all attributes of the current data table except kv.} 
for 9 € 0 do 

e ^ Entropy{9); Collect (e,d) into Pg 

end for 

b <— Select attribute form Pg with minimum entropy. 
return(&); 
cnuf 

func Entropy(0) // 9 stands for an attribute {e.g., bi) 

X ^ NumberOfRows(); ^ ^ GetPossibleValues(0); 
for (j) G <l> do 

<— ConditionedEntropy(0, (j)); y ^ CountOccurrences(0, </>); 
z = ip X Collect z into Z. 

end for 

e = Sum{Z); return(e); 
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cnuf 

func ConditionedEntropy (attribute, value) 

S' ^ 0; Vfc ^ GetPossibleValues(kv); 
for Vk G Vfe do 

probability ^ Prob(kv, Vk, attribute, value); 

S ^ S + (probability x log 2 (probability)); 

end for 
return(-S); 
cnuf 

func Prob(kv, Vk, attribute, value) 

v\ ^ CondCount(kv,Wfc attribute value); {How many times does kv take Vk, 
and attribute takes value?} 

V 2 <— CountOccurrence(attribute, value); {How many times does the specified 
attribute take the specified value?} 

^ Coerce(f 2 ); {Arithmetic precision in double float.} 
return()({); 
cnuf 



The best attribute to select next is the one with the lowest entropy. The 
entropy of each attributes in the data table is computed according to the func- 
tion Entropy, and the lowest one selected. The total entropy of an attribute, 
with respect to a target attribute is a weighted sum of the conditioned entropies 
given all possible values of the attribute. It represents the expected amount of 
information left to be determined after the value for attribute has been specified. 
This value is minimum for the best attribute to be tested next. The conditioned 
entropy of the target attribute, given that attribute has a particular value is 
computed from the conditional probabilities (Prob). The function Prob com- 
putes the conditioned probability that the target attribute (z.e., kv) has value 
Vk {i.e., i’th value of kv), given that a particular attribute has a value. This 
is computed for a particular set of data by actually computing the frequency 
with which the target attribute takes the value Vk when the attribute has the 
specified value, divided by the overall frequency with which the attribute has 
value, where value is the j’th value of attribute. Note that our classification al- 
gorithm is similar to the one used in IDS, with following differences: (1) In IDS 
any attribute can be the target attribute whereas the target one is our algo- 
rithm is the key- value (kv); (2) In IDS there should be on place the tests for the 
appropriateness of the input {e.g., few attributes or data), whereas in our case, 
thanks to Forminput For Learning, this isn’t necessary. (S) The memory can 
be saved by the following data structure of table: (a) allocate at most £ memory 
places (string) for each element of the first column (of the table) except the first 
one {i.e., table[l][£ + 1]). (b) represent the other elements - the majority of the 
cases - just by a short integer. These considerations bring the simplifications and 
efficiency in our code compared to IDS. 
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4 Experiments and Comparisons 

Based on the main algorithm described in Section |H] we have created implemen- 
tation of dictionaries. Below, we illustrate the benefits our method by reporting 
the results using the three following inputs. 
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Fig. 1. An acyclic 2-subsequential 
transducer obtained by the method of 
Mihov and Maurel |13l Page 150]. A 
(14,16)-automaton and minimal except 
for the word jul. 
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Fig. 2. Our solution - a (12,16) unla- 
beled automaton. The output function 
{e.g., kv(feb) = {28/29}) is learned by 
questions posed about the last character 
(bl) of the keys of the input language 
e.g., If bl=’c/g/l/n’ Then kv=31. 



Input 1: Let us consider the following example due to Mihov and Maurel [13] 
Page 150]: / = {{apr, 30), {aug, 31), {dec, 31), {feb, 28/29), {jan, 31), {jul, 31)}. 

Figure [T] shows the associated transducer - a (14,16)-automaton which can be 
obtained using their method. Figure [2] reports our solution: a (12,16) unlabeled 
automaton. The output function is learned via bi as the best attribute and the 
root of the dt of three leaves: (1) If bj =’c/g/l/n’ Then kv=31; (2) If bi=’r’ 
Then kv=30; (3) If bi=’b’ Then kv=28/29. 

Input 2: The benefits of our work can be best described in the case of the 
ambiguous finite-state transducers (AFST). An AFST returns for every accepted 
input string one or more output strings by following different alternative paths 
from the initial state to a final state m- In addition, there may be a number of 
other paths that are followed from the initial state up to a certain point where 
they fail. Following these latter paths is necessary but represents an inefficiency 
(loss of time). Consider the input alphabet |a,6, c} along with the followings 
input and output languages: {cabba,cabca} and kv{cabca) = {yzxxy,yzyyy}] 
kv{cabba) = {xxxxx, xxyyx, xyzyx} respectively. The output function can easily 
be characterized using b 2 as shown in Figure |T| a (7,7) automaton along with 
two decision rules. Figure [3] is due to Kempe [T21 Page 158] showing a (13,16) 
automaton for this input. 
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Fig. 3. Example of ambiguous finite- 
state transducer shown by a (13,16) au- 
tomaton [121 Page 158]. 
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If b2 = 'b' Then vl = ptxxxx, xxyyx, xyzyx] 
If b2 = ' c' Then v2 = [yzxxy, yzyyy] 

Fig. 4. Our alternative - a (7,7) unla- 
beled automaton along with two deci- 
sion rules. 



Input 3: French linguists divided the French infinitive verbs of the third group 
(fv3 - irregular ones) into 60 classes ranging from 23 to 82 [3j. Suppose, we 
are interested in determining the class-number of any verb of fv3. An obvious 
solution is to use the longest common suffix of each class established by linguists. 
For instance, one can say that, if a verb (of fv3) ends in ’-etir’ then the key- 
value is 26, and so on. But, our classification beats such ’onerous’ classification 
by providing a cheap one: If the fourth character from right-to-left (64) of a 
recognized verb is ’e’ then the output value is 26. Another path-solution is (b4 
e b6 q kv 24) meaning that if b4=’e’ and b6=’q’ then the output value is 24. 
Our experiment done on fv3 with 363 (p) verbs outputs a (310,652)-automaton 
and a decision tree of 109 leaves. Part of the latter is shown below in terms of 
the path-solutions which is easier to visualize. 



(b4 e kv 26) (b4 e b6 q kv 24) (b4 e b6 * kv 37) (b4 v b6 e b8 p kv 45) (b4 v b6 e b8 * 
kv 44) (b4 V b6 d kv 42) (b4 v b6 s kv 41) (b4 v b6 u kv 40) (b4 v b6 r kv 39) ... (b4 
i b7 c kv 67) (b4 i b7 e b6 n kv 65) (b4 i b7 e b6 p kv 66) (b4 i b7 * b6 n kv 65) (b4 i 
b7 * b6 p kv 66) (b4 i b7 a kv 64) (b4 i b7 n kv 64) . . . (b4 u b6 - kv 82) (b4 u b6 t kv 
82) (b4 u b6 a kv 82) (b4 u b6 r kv 82) ... (b4 u b6 * kv 82) (b4 u b6 s kv 72) (b4 u 
b6 m b3 r kv 34) (b4 u b6 m b3 d kv 74) (b4 u b6 c b3 r kv 33) (b4 u b6 c b3 d kv 73) 



Since, we have 60 classes, then the optimization of the dt is desired. This is 
done by two process. First, by removing the star-tests {e.g., b6=V) in the sub- 
trees and then by shifting the reduced branches to the rightmost of the current 
subtree. Shifting is required because careless losses the desired output value. 
Second, by regrouping the identical output values having the most common 
branches in the same subtree. We performs this phase when |V| < p. Consider 
the following subtree: {(b4 i b7 * b6 n kv 65) (b4 i b7 * b6 p kv 66), (b4 i 
b7 a kv 64), (b4 i b7 n kv 64)}, after removing two star-tests (and hence two 
transfers), the reduced form is: {(b4 i b7 a kv 64), (b4 i b7 n kv 64), (b4 i b6 n kv 
65) (b4 i b6 p kv 66)}. By applying the second phase we obtain: {(b4 1 b7 a/n 
kv 64), (b4 i b6 n kv 65), (b4 i b6 p kv 66)}. We omit the optimization algorithm 
due to the space limitation. The integration of these two phases in our present 
implementation, as well as, the implementations of all alternative techniques 
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mentioned in this paper for the fair comparison {i.e., same data, same machine, 
same programming language) with our work is desired. 
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Abstract. An efficient simulation algorithm using an algebra of transients for gate 
circuits was proposed by Brzozowski and Esik. This algorithm seems capable of 
predicting all the signal changes that can occur in a circuit under worst-case delay 
conditions. We verify this claim by comparing simulation with binary analysis. 
For any feedback- free circuit consisting of 1- and 2-input gates and started in a 
stable state, we prove that all signal changes predicted by simulation occur in 
binary analysis, provided that wire delays are taken into account. Two types of 
finite automata play an important role in our proof. 



1 Introduction 

Detecting signal changes in digital circuits is important, because unwanted (hazardous) 
signal changes may affect the correctness of computations, and increase the computation 
time and energy consumption. To address this problem, Brzozowski and Esik III pro- 
posed an infinite-valued algebra C of transients, and an efficient simulation algorithm for 
gate circuits based on this algebra. In a companion paper 0 we compare the simulation 
of a circuit in C to the traditional binary analysis. We show that simulation of an arbitrary 
circuit is sufficient: all the changes that occur in binary analysis are also predicted by 
the simulation. In general, however, simulation is more pessimistic than binary analysis. 
It is the purpose of this paper to determine how pessimistic the simulation can be. 

Here we consider the class of feedback-free gate circuits with stable initial states. 
Although this is a special case, it is important in practice. For this case, we show that 
all the changes predicted by simulation also occur in binary analysis, provided that wire 
delays are taken into account. Our result is limited to gate circuits constructed with 1- 
or 2-input gates; the general case remains open. 

This paper is as self-contained as possible. The reader should see [21] for more details, 
and iQ for complete background information and proofs. 

2 Circuits, Networks, and Binary Analysis 

For an integer n > 0, [n] denotes {!,... , n}. Boolean operations OR, NOT, and XOR 
are denoted V, ~, and Y, respectively. 

Figure [TJ a) shows a gate circuit consisting of an inverter and an OR gate. It has 
input variable Xi and state variables Sa and Sb- Each state variable Si has an excitation 
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(a) 




Fig. 1. Circuit Cl and its complete version 



Si, which is the Boolean function of the corresponding gate. Here, Sa = Xi, and 
Sb = XiV Sa- The value of a variable may be different from that of its excitation. This 
allows us to represent the delay of a gate. A variable normally follows its excitation. 
However, if the excitation changes quickly, the variable may fail to follow the excitation, 
because of the inertial nature of the delay. 

To account for other delays, we construct the complete counterpart of a circuit by 
adding the following variables. For each input Xi, we add an input- gate variable sp, we 
represent the input gate by a triangle. We also consider each fork as a fork gateQ and add 
a variable for each fork output; we represent a fork gate by a rectangle. Finally, we add 
a variable for each wire. The excitations of the added variables are identity functions. 
For our example, see Fig. [1] We add input-gate variable si, fork-gate variables S 3 and 
S 4 , wire variables S 2 , S 5 , sy, and ss, and we relabel Sa as sq and Sb as sg. The new 
excitations are: = Xi, S '2 = si, S 3 = S 4 = S 2 , S 5 = S 3 , Sq = S'y = 

se, Ss = S 4 , Sg = St V sg. 

Any circuit, complete or not complete, is modeled by a network. 

Definition 1. A network OV A a tuple N = {T>,X,S,S), where T> is the domain of 
values, X = {Xi,... ,X„}, the set of inputs, S — {si,... ,Sm\, the set of state 
variables with associated excitations S\, . . . , Sm, tmd £ C [X x 5) U (5 x S), a set 
of directed edges. There is an edge between x and y iff the excitation ofy depends on x. 
The network graph is the digraph {X U S, £). 

A state of N is an m-tuple b of values from T> assigned to state variables si , . . . ,Sm- 
A total state is an (n + m) -tuple c = a • 6 of values from T>, the n-tuple a being the 
inputs, and the m-tuple b, the state. The value of 5^ in a • 6 is denoted Si{a-b). The tuple 
of all Si{a ■ b), for i G [m], is denoted S{a ■ b). For any a • b, the set of unstable state 
variables is U{a ■ b) = {s^ | bi ^ Si{a ■ 5)}. Thus, a • 6 is stable iff U{a ■ b) = 0. For 
any state variable Si G S, we define its fan-in set (f>{si) = {x \ x G XUS,{x, sf) G £}. 
We call state variables Si and Sj related if Si£~^Sj or Sj£^ Si, where is the transitive 
closure of £ ; otherwise, we call them unrelated. 

For binary analysis we use the binary domain, V = {0,1}. We describe the behavior 
of a network started in a given state with the input kept constant at value a G {0,1}” 
by defining a binary relation Ha on the set {0, 1}"* of states of X. For any b G {0, 1}™, 
bRab, if U{a ■ b) = 0, and bRab^ , if U{a ■ b) ^ 0, and K is any nonempty subset of 
U{a-b), where by b^ we mean b with the variables in K complemented. No other pairs 
are related by Ra . As usual, we associate a digraph with the Ra relation, and denote it Ga ■ 

Our reason for calling input delays and forks gates will become clear later. 
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Fig. 2. Circuit C2 and its binary analysis 



For given a S {0, 1}", and 6 G {0, 1}™ we define the set of all states reachable from 
b in relation Ra as reach{Ra{b)) = {c | 6 i?*c}, where i?* is the reflexive and transitive 
closure of Ra. We denote by Ga{b) the subgraph of Ga corresponding to reach{Ra{b)) . 

Example 1. For the complete circuit in Figure with excitations iSi = Xi, S '2 = 
X 2 , S '3 = si, 5'4 = S 2 , 5s = S3 V S4, we show Goi(lOlOl), where tuples are shown 
as words, unstable variables are underlined, and boldface features and edge labels are 
for later use. 



3 Transients, Gate Automata, and Extended Functions 

A transient m is a nonempty word over { 0 , 1 } in which no two consecutive symbols 
are the same. Thus the set of all transients is 

T = 0(10)* U 1(01)* U 0(10)*1 U 1(01)*0. 

Transients represent changing signals in a natural way; for instance, transient 010 repre- 
sents a signal changing from 0 to 1 to 0. For any t G T we denote by a(t) and w(t) its 
first and last characters, respectively, and by |t|, its length. A transient can be obtained 
from any nonempty binary word by contraction, i.e., the elimination of duplicates imme- 
diately following a symbol (e.g., the contraction of 00100011 is 0101). For a binary word 
s we denote its contraction by s. For any t, t' G T, we denote by t o t' concatenation 
followed by contraction, i.e., tot' = tt'. 

Extensions of Boolean functions to the domain T of transients were defined in ||T]. 
Here we give an equivalent definition using finite automata, needed for later proofs. For 
common Boolean functions, (T|] gives simpler formulas for computing extensions. For 
example, let 0 be the extension of the OR function. Then, for transients w, w' of length 
> 1 , w 0 w' = t, where t is such that 

a(t) = a(w) V a(w'), w(t) = w(w) V w(w'), and z(t) = z(w) 0 z(w') — 1, 
where z(t) is the number of Os in t. Also, 



t 0 O = O 0 t = t, and t 0 l = l 0 t = l. 
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S2 




Fig. 3. OR gate automaton 



Let Si be the state variable of a gate, let (f>{si) = {si, . . . , Sfe}, and let the excitation 
Si be Boolean function / : {0, 1}^ ^ {0, 1}. We extend / to f : ^ T. For 

any tuple(ti, . . . , tj.) of transients, f(ti, . . . , tj.) is the longest transient the gate could 
produce at its output, if its inputs changed as shown by the k transients. Formally, the 
definition uses a finite automaton to model the gate behavior. For any j G [A:] , we denote 
by Cj the j-th unit tuple in {0, 1}^ that has a 1 in position j and Os elsewhere. By taking 
the component-wise exclusive OR of a tuple t G {0, 1}^ with ej, we obtain the tuple 
f' G {0, 1}^ that differs from t only in the jth component. 

Definition 2. The gate automaton of a gate Si with Boolean function f : {0,1}^' ^ 
{0, 1} is an (uninitialized) Moore machine Qi = (Ii,0,Vi,Ti, Oi), where f = fi^Si) 
is the input alphabet, O — {0, 1}, the output alphabet, Vi = {0, 1}^, the set of states, 
Ti : ViXli ^ Vi, the transition function defined for p G Vi,Sj G IiasTi(p, Sj) = pS-Cj , 
and Oi : Vi ^ O, the output function defined for p G Vi as ofip) = f{p). IfGi has an 
initial state p^, then it is denoted by (Qi,p^). 



Example 2. The automaton of an OR gate with inputs si and S 2 is shown in Fig.[^ 

We extend Ti to words as follows: for the empty word e, rfip, e) = p, and for any 
p G Vi and w G I*, Ti{p,wSj) = Ti{p,w) Y. ej. Any state of Gi can be the initial 
state, depending on the tuple of transients for which we compute the extension. Suppose 
the initial state is In (Gi,Pi), any input word u G I* produces an output word v 
as follows. If u is the empty word, then v = f{Pi)- Otherwise, u = u'sj produces 
V = wf(Ti{pi, u)), where w is the output word of u'. The contraction fi of u is the 
output profile of u. For any tuple (ti, . . . ,t/j) of transients, describing the changes of 
variables si, . . . , Sfc, we choose = (a(ti), . . . , a(tfc)). Thus p^ shows the initial 
values of si, . . . , s^. An input word u G I* determines the unique tuple (ti, . . . , tfc) 
of transients if |tt|s. = |t_, | — 1, for all j G [k], where |u|s- is the number of times Sj 
occurs in u. There may be several input words determining a given tuple of transients; 
letZT(ti, . . . ,tfc) be the set of all words determining (ti, . . . ,1^.). Let V(ti, . . . ,tfc) 
be the set of output profiles of the words inZT(ti, . . . , tfc), and let Vmaxi'ti, ... , tfc) be 
the longest profile in V(ti, . . . , tfc). 

Definition 3. For any Boolean function f : {0, 1}^ — *■ {0, 1}, we define its extension 
f : T* T foy f(ti, ... ,tfc) = Vmax{ti ,. . . ,tk),forall (ti, . . . ,tfc) S T'^. 
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Definition 4. For gate automaton an input word u G I* is called worst-case 

if u determines (ti, . . . ,tk), and has the longest output profile among the words in 
lAiti, . . . ,t]f), i.e., if V = Vmaxifi’: ■ ■ ■ j tfc)> where v is the output word ofu. 

Example 3. We illustrate the new concepts of this section. Consider the automaton of 
Fig.[3l with = 00. The output produced by u = S 1 S 2 S 1 is u = 0111, and its output 
profile is i) = 01. If it = S 1 S 1 S 2 S 2 , then v = 01010, and v = 01010. Also 

Zi( 010 , 010 ) = {S1S2S1S2, S1S2S2S1, S1S1S2S2, S2S1S1S2, S2S1S2S1, S 2 S 2 SlSl}- 

We compute 01 © 010. We find Z^(01,010) = {S 1 S 2 S 2 , S 2 S 1 S 2 , S 2 S 2 S 1 } and V = 
{01, 0101}. Hence 01 © 010 = 0101. Word S 2 S 2 S 1 is worst-case, in contrast to S 1 S 2 S 2 
or S 2 S 1 S 2 . 



4 Simulation 

To simulate a circuit having binary network iV = ({0,l},A’,5,£),we use the transient 
network N = The inputs and state variables of N and N are the same, but 

in N they take values in the domain T, and excitations in N are the extensions to T of 
the Boolean excitations in N . Binary variables, words, tuples and excitations in N are 
denoted by italic characters {e.g., s, S). Transients, tuples of transients, and excitations 
in N are denoted by boldface characters (e.g., s, S). 

The simulation consists of Algorithm A (ij given below. We want to know what 
happens when the network starts in a stable binary initial state d ■ b, and the input is 
changed to a. We set the input of network N to a o a, where o is applied componentwise. 
We then change all variable values to the values of their excitations. For feedback-free 
circuits A always terminates. Let the sequence of states produced by A be s°, . . . ,s^ . 
This sequence is nondecreasing in the prefix order on T; we say that A is monotonic. 

Algorithm A 

a = a o a; 
sO := 6; 
h := 1; 

:= S(a • s°); 

while (s^ <> ^ 

h ;= /i © I 5 
:= S(a • 

Example 4. For the circuit of Fig.dJb), the extended excitations are: Si = Xi, S 2 = 
Si, S 3 = S 4 = S 2 , S 5 = S 3 , Sg = S 5 , S 7 = Sg, Sg = S 4 , Sg = S 7 © Sg. Input Xi 
changes from a = 0 to a = 1, and b — 000001101. The result is in the table above. 

Theorem 1. Let N = (T, X , S, £) be the complete version of a feedback-free network 
N = (T, X , S, £). Let be the result of Algorithm A for N started in stable binary 
total state a, ■ b, with input do a. Let be the result of Algorithm A for N, started in 
stable binary total state d ■ b, where bi = bi, for all Si € S, with the same input. Then 
and agree on the variables in S. 
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The proof is given in a . The theorem shows that adding wire delays to N does not 
affect the number of signal changes in N. In other words, simulation takes wire delays 
into account automatically. 



5 Covering of Simulation by Binary Analysis 

Let N be the binary network of a complete feedback-free circuit. We start N in stable 
total state a ■ b, and change the input tuple a to a. In the resulting state a ■ b only the 
input gates corresponding to the inputs that change are unstable, all other state variables 
being stable. Let Ga{b) be the result of the binary analysis of N with initial state a ■ b. 
Here, by a path in Ga{b) we always mean a path starting in state b. 

Definition 5. Let tt = s°, . . . be any path in Ga{b). Recall that each is a tuple 
(s{, . . . , s^,)- ^ [^]> denote by cr^ the transient ... sf showing the 

changes of Si along tt. Wfe call erf the history of Si. We also define Sf to be Ei, where 
Ei = Si{a ■ s^) . . . Si{a ■ s^) is the transient showing the changes of Si along tt. We call 
Sf the excitation history of Si. We denote by cr'^ the tuple {erf^ . . . , cr'ff). 

Let N be the transient counterpart of N , and let the result of Algorithm A for N, with 
initial state a ■ b, and input a o a be s° , . . . , . If we find a path tt in Ga{b) whose history 

matches the last state of the simulation, then tt covers all simulation states, due to the 
monotonicity of Algorithm A. Thus, we are looking for a matching path, defined next. 

Definition 6. Let tt be a path in Ga(b). Let V C S be a set of state variables in N. Path 
TT is matching on V if erf = , for all Si G V. 

Our main result is stated next. The rest of the paper is devoted to its proof. 

Theorem 2. Binary analysis covers simulation in the following sense. There exists a 
path TT in Ga{b) that is matching on S, i.e., that satisfies cr'^ — . 

Let Gf{b) be the subgraph of Ga{b) in which exactly one unstable variable changes 
at each step. For example, Gf{b) is shown in boldface in Fig.0 The next proposition ||5], 
stated without proof, allows us to restrict ourselves to Gf{b). Paths tt and tt' are equivalent 
if they have the same history, i.e., if cr^ — cr^ . 

Proposition 1. For any path tt in Ga{b) there exists an equivalent path tt' in G'^(b). 

Since the circuit is feedback-free, we can arrange the state variables of N in levels 
as follows: level 0 consists of the input gates; level 1 is comprised of all variables whose 
fan-in set belongs to level 0, and in general level I consists of all variables whose fan-in 
variables belong to levels < I, and which have at least one fan-in variable in level I — 1. 
This level assignment results in even levels containing gate variables and odd levels 
containing wire variables^ We use level{si) to denote the level of Si. The last level 
of any network is always a gate level, so N has an even number 2L of levels. Let 21, 

^ For this reason we consider input delays and forks as gates. 
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Fig. 4. Sample circuit with partition. 



where 0 < / < L, be a gate level of the circuit. Let V2/ = S 5 | level(si) = 21}. 
Let V = U (j){si). Suppose V = {si, . . . , «/<-}. Note that Si, . . . ,sk are all wire 

variables and are initially stable; they are not necessarily all of level 21—1, but they belong 
to levels < 21. The circuit is partitioned by these variables into two areas, denoted C 21 and 
TZ 21 .Area.C 21 contains the gates of level 0 and those of levels < 21 together with their fan- 
in variables. Area 7?.2i contains the gates of level 2^ and those of levels > 21 together with 
their fan-in variables. Since the circuit is feedback-free, there are no signals flowing from 
TZ 21 to C 21 , but there may be wires that connect outputs of gates in C 21 to inputs of gates 
of levels > 21 in 7^2^■ Formally, C 21 = U U U 0 </cv<.;(s 0 < 2 f({si} U (t>{si)), 

level{si)—{) level(si) even 

and 7 ^ 2 ^ = U U [j ievei(si)>2i, ({sj u (l){si)). Note that C2i,V,Tl2i form 

level{si)—2l level(si) even 

a partition of the set S of state variables in N, C2 = IJ {si}, TZ2L+2 = 0 and 

level{si)—Q 

C2L+2 = S, if we assume a fictitious gate level 2L + 2. 

Example 5 . We illustrate the partition in Fig.| 4 ] Here I = 3. We relabel the state variables 
in C21 with subscripted As, and those in TZ2i, with subscripted ps. 



5.1 Proof of Theorem | 2 ] 

The proof of TheoremElis by induction on I, where 1 < / < L. We give only a sketch of 
the proof here (see Q for details). We show there exists a path in G^( 6 ) that is matching 
on C2L+2 = S. The basis consists of showing there exists a path that is matching on C2, 
i.e., on the input-gate variables. In the induction step we assume we have a path r that is 
matching on C21, and construct a path tt that is matching on £2Z-i-2- A preliminary result 
characterizes tt in terms of the hazard-preserving and worst-case properties defined next. 

Definition 7 . Let tt be a path in Ga{b), and Si a state variable that is initially stable. 
We call TT hazard-preserving on Si ifcr'^ = For V C S, path tt is hazard-preserving 
on V if it is hazard-preserving on all Si G V. 



Feedback-Free Circuits in the Algebra of Transients 



113 




Fig. 5. Delay automata: (a) for variable s;, and (b) for set {si , S2} 



Definition 8. Let it be a path in Ga{b), and Si a gate variable implementing Boolean 

function f : {0, 1}^ > {0, 1} that depends only on state variables. Let 4>{si) = 

{si, . . . , Sfc}. We call tt worst-case on si if = i{crj, . . . , cr^). For a set of gate 
variables C 5, path it is worst-case on V if it is worst-case on each Si G V. 



Example 6. Consider the graph in Fig.E] Path tt = 10101, 00101, 00001, 01001, 01011 
is hazard-preserving on S 3 and S 4 , but not on S 5 , since erg = 1 and J7g = 101. Path tt 
is worst-case on S 5 , but tt' = 10101, 11101, 11111, 01111, 01011 is not, since = 1 
and 10 0 01 = 101. 



Lemma 1. Path tt in G'g^(b) is matching on C 21+2 iff it is matching on C 21 , hazard- 
preserving on V and V 21 , and worst-case on ¥ 21 - 



5.2 Hazard-Preserving Paths 

We now characterize hazard-preserving paths by automata. For V = {si, . . . , s^} C S, 
let S{V) = {5'i I Si G V}, and A{V) = V U A{V). Suppose the variables in V are 
initially stable in G'^{b), unrelated to each other, and have pairwise distinct excitations. 
Recall that every state variable represents a delay. We want to describe the hazard- 
preserving behavior of these delays in G^( 6 ), i.e., we are interested in paths on which 
the k delays do not ‘lose’ any changes. We describe the hazard-preserving behavior 
of Si G y by the automaton shown in Fig. [3a); this is Vf, the delay automaton for 
variable Si. The label on a transition of the automaton shows the excitation or variable 
that changes in that transition. Subscript j ranges over [fc] \ z. The label of each state 
shows whether Si is stable (label is 0) or unstable (label is {si}) in that state. Changes 
of excitations or variables other than Si and Si do not alter the state, since variables in V 
are unrelated and have distinct excitations. Variable Si changes each time it is unstable, 
so as not to lose any changes. 

To describe the hazard-preserving behavior of all G V at the same time we take 
the direct product (4j| Vy of Vy, . . . Vy. 
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Example 7. The delay automaton of set V = {si,S 2 } is shown in Fig. |3b). The 
nonempty components of a state label show the variables that are unstable. 

Let C{T>v) be the language accepted by T>v- We call the words in C{T>v) balanced 
on V . By the dehnition of the direct product, we have C{T>v) — C{T>y) n . . . n C(Vy). 
From the definition of each it follows that w S Ay belongs to C{Vy) iff = 

, for some integer Ci > 0, where is the projection of w to alphabet A. Then 
a word w G Ay is balanced on V iff, for all Si G V, wi{Si,si}= {SiSiY', for some 
integer Ci > 0. Language C{T>v) is a regular subset of the Dyck language [|5]. 

We now establish the relation between hazard-preserving paths and delay automata. 
Let V be defined as before. We limit our interest to paths that are hazard-preserving on V. 

To any path tt in G'^{h) we associate w'^ G Ay called the path-word on V as follows: 
we label each step of the path with Si if Si changes, and with Si if Si changes in that 
step, for Si G A(V), Si G V. Other steps are labelled by e. Since the variables of V are 
unrelated and have distinct excitations, each step has a single label. Path-word w'^ is the 
concatenation of the labels along path tt. 

Example 8. Consider Ga{b) of ExampleHl The subgraph G'g^{b) is shown in Fig.Elby 
boldface edges. We choose V = { 53 , 54 } and show the labels on edges. The values of 
S3, S4 are also in boldface in each state. For tt = 10101, 00101, 01101, 01111, 01011, 

= S3S4S4S3. 

We denote by Hv the set of all paths in G'^{b) that are hazard-preserving on V. 
Let Wy = I Tt G Ti-v}- The delay automaton is quite general, and applies to 
any network N and any G'^b), as long as we find a set V that satisfies the necessary 
requirements. For a particular network N and GJj(&), not all words accepted by the 
automaton correspond to paths in G'^{b). We find a necessary and sufficient condition 
for a balanced word to be a path-word. 

Definition 9. A word w G C{T>v) is relevant to G'g^{b) iff there exists a path tt in G'ffb) 
such that wls(^Y')= We denote by C(T>y)Iqi the set of all words in L{T>y) 

that are relevant to G'ffb). 



Example 9. For the circuit and G'ffb) of Example with V = { 53 , 54 }, the delay 
automaton Vy is in Fig. 0 (right), with S3, S3, S4, 54 taking the roles of ^i, Si, 5 ' 2 , S 2 , 
respectively. For example, S' 4 S' 3 S 4 S 3 and S3S3S4S4 are relevant, S3S3S3S4S3S4 and 
S3S4S4S4S3S4 are irrelevant. 

We state without proof (see B for details) the following proposition that reduces 
finding a hazard-preserving path to finding a relevant balanced word. 

Proposition 2. Wy = C(T’y)iGi(b)- 

5.3 Worst-Case Paths 

We now characterize worst-case paths using gate automata. Let Si be any gate variable 
in S, with (j){si) = { 51 , . . . , s^j, and let {Qi,pY) be its gate automaton, with = 
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(6i, . . . , 6fc) (the values of si, . . . , Sfe in b). For any path tt in G'^{b) we label with Sj 
each step in which Sj changes, for all j G [A:]. The word obtained by concatenating 
the labels along tt is an input word u'^ G I* that shows how the fan-in variables of 
Si change along tt. The output word produced by u'^ shows how the excitation Si 
changes on tt. Let ti, . . . , be the transients determined by u^. Then the following 
hold; 1) (tJ = t^-, for all j C [k], and 2) SJ = v'^ . Let tt be a path in GJ{b) labelled 
with as above. 

Proposition 3. Path tt is worst-case on Si ijfu" is a worst-case word for (Gi,Pi)- 

5.4 Delay Automata and Gate Automata 

Having reduced finding a hazard-preserving path to finding a relevant balanced word, 
and finding a worst-case path to finding a worst-case word, we now state an important 
lemma that relates these two kinds of words, and guarantees the existence of a path that 
is both hazard-preserving and worst-case. For any gate variable Si of a feedback-free 
circuit, we have a delay automaton for its fan-in set, and agate automaton {Qi,pf). 

For any alphabet A, a word r G is called prefix-restricted if r has a prefix r' having 

exactly one occurrence of each letter of r. We call r' the key prefix of r. For example, 
word abaabb is prefix-restricted, with key prefix ab, but aabab is not prefix-restricted. 

Recall that li = 4>{si), S{Ii) is the set of the excitations of the variables in f, and 
Aj. — liU S{Ii) is the alphabet of I?/. . The lemma below relates words over S{Ii) to 
words over f in the following sense. Given any prefix-restricted word over S{Ii), we 
can always find a worst-case word over 7^, such that an interleaving of the two words is 
a balanced word. The result is limited to 1- and 2-input gates; we conjecture the result 
to be true in the general case. 

Lemma 2. Let (Gi,Pi) be the gate automaton of variable Si, for a 1- or 2-input gate. 
For any prefix-restricted word r G S(Ii)* having key prefix r' , there exists a balanced 
word w G such that r, and wIj^ is a worst-case word for (Gi,Pi)- 

Also, w has a prefix w' such that w' r' , andthe output profile of w' I j. has length 

2 if the output profile ofw[i^ has length > 1 . 

Proposition [21 and Lemma |2]help us construct a path tt satisfying the conditions of 
Lemma |T] and hence conclude the proof of Theorem |21 For details see [HI. 

6 Conclusions 

Our results can be summarized as follows. Assume that we have a feedback-free gate 
network N, in which state variables are associated with gates only. We perform the 
binary analysis of N started in a stable state. We extend the Boolean functions in N to 
functions on transients. This gives us the transient version N of N. We now simulate 
N using algorithm A; for feedback-free circuits this algorithm always terminates. Next, 
we add wire delays to N, obtaining the complete binary network N and its transient 
counterpart N. By Theorem 1, the simulation of N agrees with the simulation of N on 
the variables of N. Finally, by Theorem 2, we know that binary analysis of N covers the 
simulation of N, and hence that of N. In conclusion, we have shown that simulation of 
feedback-free circuits is not pessimistic, if wire delays are taken into account. 
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Abstract. A deterministic finite automaton (DFA) A is called a cover 
automaton (DFCA) for a finite language L over some alphabet S if 
L — L(A) n with I being the length of some longest word in L. 
Thus a word w £ E* is in L if and only if |ic| < I and w £ L{A). The 
DFCA A is minimal if no DFCA for L has fewer states. 

In this paper, we present an algorithm which converts an n-state DFA for 
some finite language L into a corresponding minimal DFCA, using only 
0(n log n) time and 0(n) space. The best previously known algorithm 
121 requires O(n^) time and space. Furthermore, the new algorithm can 
also be used to minimize any DFCA, where the best previous method [T] 
takes O(n^) time and space. 



1 Introduction 

Regular languages and finite automata (DFA) are widely used in theoretical 
computer science and have been studied extensively. However, many applications 
only deal with finite languages. In this paper we analyze cover automata (DFCA) 
which are capable of parsing these languages as follows. Let L be some finite 
language and I the length of its longest word(s). A DFCA for L accepts a word 
w with |?ii| < I if and only if re £ L, but it may also accept additional words 
being longer than 1. Thus deciding the membership problem with a DFCA only 
requires an additional comparison of two integers. 

Recently, DFCA have been further elaborated FEE!. Compared with ordi- 
nary DFA, corresponding DFCA are much smaller in many cases. As an example, 
let L = {a, aba, ababa}. Figure [I] (a) shows a minimal finite automaton accepting 
L with seven states. The cover automaton in Fig. [T| (b) also accepts some longer 
words, but only has three states. Hence DFCA are more space efficient. 

Paun et.al. |3] have suggested an algorithm for converting an n-state DFA 
into a corresponding DFCA with the fewest possible number of states. It requires 
O(n^) time and space. Another algorithm [T] consuming O(n^) time and space 
has been proposed for directly minimizing a DFCA. However, in this paper we 
shall show that both problems can be solved in 0(n log n) time and 0{n) space 
by decomposing the set of states of the original DFA resp. DFCA accordingly. 

The paper is organized as follows. The next section reviews some notation 
and basic results on DFCA. Section O introduces similarity state decompositions 
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Fig. 1. A minimal DFA (a) and a minimal DFCA (b) for L = {a, aba, ababa} 



(SSD). Given such a decomposition, a minimal DFCA can be easily established. 
An SSD can be computed with the algorithm presented in Sect.lH Finally, Sect.E] 
contains the complexity analysis and some concluding remarks. 

2 Preliminaries 

We assume the reader to be familiar with the theory of finite automata as pre- 
sented in standard books, e.g. [B]. A DFA is a quintuple A = {Q,S,6,qo,F), 
where Q = {qg,. . . , qn-i} is a finite set of states, A is a finite nonempty set of 
input symbols, S : Q x E ^ Q is a, (total) transition function, go € Q is the 
initial state, and F C Q is the set of final states. We extend (5 to Q x A* by 
setting S{q,e) = q and S(q,aw) = S{S{q,a),w), where a G E and w G A*. Let 
|w| be the length of a word w G A*. For A: > 0 let A-^ = {w e A* | |?ii| < k}. 
The language recognised by A is L{A) = {w S A* | 6{qo,w) G A}. 

For the rest of the paper, we assume L to be some fixed finite language. By 
I we denote the maximal length of a word in L. 

Definition 1. A DFA A is a deterministic finite cover automaton (DFCA) for 
L iff L{A) n A-* = L. (We then also say that A covers L.) A is called minimal 
if no DFCA for L has fewer states than A. 

All states of a DFCA A = {Q, A, S, qo, F) for L are assumed to be useful, i.e., for 
q G Q there exists some word w G E* such that 6{qo,w) = q (useless states can 
be easily removed in time linear in the number of states n). This allows us to 
define level^(g) := min{|w| | 6(qo,w) = q} for all q G Q. We simply write level(g) 
when the corresponding DFA is clear from the context. To give an example, in 
Fig. m (a) we have level(go) = 0, level(gi) = 1, level(g 2 ) = 2, level(g 3 ) = 3, 
level(g 4 ) = 4, level(g 5 ) = 5, and level(g6) = 1- By applying a breadth-first- 
search algorithm [1] to the state transition diagram associated with A, the levels 
of all states can also be computed in linear time. Clearly, a state q G Q with 
level^(g) > I can be removed from A without changing L{A) H E-K Thus for 
the rest of this paper we assume level(g) < I for all states q of any automata. 
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Definition 2 . Let A = {Q, S, S, qo, F) be a DFA and p,q £ Q. We say p and q 
are similar (denoted p ^ q) iff 6{p,w) £ F S{q,w) £ F for all w £ , 

where k = m&x{level{p) , level{q)}. Otherwise, p and q are dissimilar (p / q). 

For example, concerning the DFA in Fig. |I](a), it is rather easy to see that q2 
and (74 are similar. Firstly note that I = 5 and max{level((72), level((74)} = 4 . Now 
clearly S{q2,w) £ F 6{q4,w) £ F for all w £ = {e,a,b}. However, 54 

and gs are dissimilar because S{q4,e) = q4 ^ F and (5(g5,e) £ F. 

Note that ^ is reflexive and symmetric, but not transitive (see PQ for further 
properties) . A DFCA A for L is known to be minimal if and only if all states of A 
are pairwise dissimilar However, checking similarity for all pairs of states in a 
straightforward way consumes at least time. We therefore use a different 

method to obtain a minimal DFCA, which is presented in the next section. 

3 Similarity Decompositions 

This section gives an introduction to similarity state decompositions. We shall 
prove that once such a decomposition is given, it is easy to construct a minimal 
DFCA for L. We start with the basic definition. 

Definition 3 . Let A = (Q, E, 5 , qo, F) be a DFA. We say that {Qi , . . . , Q^) is 
a similarity state decomposition (SSD) of Q if 

a) QiU...UQr = Q 

b) Qi n Qj = ib for all 1 < i < j < r 

c) V 1 < i < r Vp, q£ Qi'. p ^ q 

d) yi<i<j<r 3 p£Qi 3 q£Qj-. p q 

For all q £ Q we denote by [g] the set Qi of the decomposition such that q £ Qi- 

Concerning our example in Fig.[l](a), the reader can easily verify the following 
properties: go -- g2, go ^ 94, q2 ^ 94, 9 i ^ 93, 9 i ^ 95, 93 ^ 95, 9 o / 9 i, 9 o / 9 e, 
and gi / go. Hence Qi := {go,92,g4}, Q2 ■= {gi,g3,95}, and Q3 := {go} form 
an SSD. 

The goal for this section is to show the following theorem. 

Theorem 1 . Let (Qi, ■ • ■ , Qr) be an SSD for some DFCA A = (Q, S, S, qo, F) 
covering L. Choose pi £ Qi such that level{pi) = min{Ze?;eZ(g) \ q £ Qi}, for all 
I < i < r. Define a DFA B = {Q' , E, S' , q'o, F') by setting Q' := {Qi, ■ ■ ■ ,Qr}, 
9 o := [90], F' := |[g] \q£ F}, and 

5 '{Qi, a) := [S{pi, a)] , for alll <i <r and a £ E . 

Then B is a minimal DFCA for L. 

In Fig. [T](a), we have pi := qo, P2 ■= 9 i, and po := go. Now, for example, since 
^{P2,b) = 5 {qi,b) = 92 G Qi, we obtain S'{Q2,b) = Qi. By checking the other 
transitions we finally end up with the DFCA shown in Fig. [T] (b). 

We require three auxiliary results to prove Thm. [T 1 
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Lemma 1. Let A and B be as in Thm. [7] Then q € F [q] € F' for all q € Q. 

Proof By definition, [g] G F' for all q G F. Now let [q] G F' , i.e., q G [p] for 
some p G F. Since we have p q, levels (p) < I and levels (g) < I, we conclude 
that p G F ^ 6{p, e) G F S{q, e) G F ^ q G F. □ 

Lemma 2. Let A and B be as given in Thm. \J\ Let w = a\ ... Uk G S-f Then 
for 0 < j < k there exists some p G S'{qQ, a\ . . . aj) such that levelj.{p) < j and 
6{p, Oj+i . . . Qk) G F w G L{A) . 

Proof. If j = 0, then we can choose p := go- To prove the claim for j + 1, let 
i such that Qi = (5'(gg,ai . . .Oj). By induction there exists some p' G Qi with 
level_ 4 (p') < j and S{p', a^+i . . . Uk) G F w G L{A). The properties of pi imply 
level_ 4 (pi) < levels (p') < j. Since pi ~ p' and |aj+i . . . at\ = k — j < I — j ^ we 
have S{pi, aj+i . . . Ok) G F ^ S{p', a^+i . . . Uk) G F ^ w G L{A). Now choose 
p := 6{pi,aj^i). Then levels (p) < levels (pi) + 1 < levels (p') + 1 < j + 1 and 
pG [6{pi,aj+i)] = 6'{Qi,aj+i) = F(go, oi . . . Oj+i). Also, since (5(p, 0^+2 . . . Ofc) 
equals S{pi, Oj+i . . . ak), it follows that <5(p, 0^+2 • ■ ■ Ofe) G F ^ w G L{A). □ 



Lemma 3. Let A and B be as given in Thm.\^ Then Zet;efe([g]) < levelj^{q) for 
all q G Q. 

Proof. Assuming the contrary, let k := levels (g) be the smallest number such 
that levele([g]) > fc + 1 for some q G Q. Clearly q ^ q^ and k > 1 because 
levels ([go]) = levels (go) = 0. Let p G Q and a G S such that levels (p) = k — 1 
and S{p,a) = q. By our choice of k, levels ([p]) < fc — 1. Let [p] = Qi and 
5'{[p],a) = Qj for suitable i and j. Then 6{pi,a) G Qj. Moreover, levels([g]) > k 
and levels (Qj) < levels ([p]) + 1 < fc imply that Qj ^ [gj. So there exist p' G Qj 
and g' G [g] such that p' q' . Since [g'j = [g], levels (g') < k would contradict 
our choice of k and g. Hence level^(g') > k. Thus there exists some w G 
such that 6{p',w) G F ^ 6{q',w) ^ F and jwj < / — max{level^(p'), levels (g')}. 
Since level^((5(pi, a)) < levels (pi) + 1 < levels (p) + 1 = fc and S{pi, a) ~ p' (both 
states are in Qj), we conclude that S{S{pi,a),w) G F ^ 6{p',w) G F. Similarly, 
level_ 4 (g) = k and q ^ q' imply that 6{6{p,a),w) = 6{q,w) G F 6{q',w) G F. 
So altogether it follows that 6{pi,aw) G F ^ 5{p,aw) ^ F. However, since 
jaw] < ^ — fc + 1 and levels (pi) < levels (p) = fc — 1 , this contradicts pi ^ p. □ 

We are now prepared for proving the previously stated theorem. 

Proof (of Thm. [7J). Let w = a\. . .ak G S-K By using Lemma [5] with j = fc, 
there exists some p G <5'(go, w) such that w G L{A) ^ 6{p, e) G F p G F and 
levels (p) < 1. Also, by Lemma [T] p G F [p] = <5'(go, w) G F' w G L{B). So 
for all w G we have w G L w G L{A) w G L{B). Hence B covers L. 

To show that B is minimal, assume a DFCA C = (Q, E, 6, go, F) covering L 
with IQI < IQ'I = r. For all 1 < i < r let Vi G E* such that |r)i| = levelB(Qi) and 
= Qi- Since |Q| < r, there exist f < i < j < r with S{qo,Vi) = S{qo,Vj). 
By the properties of an SSD, we can choose p G Qi and g G Qj with p 7 ^ g, so it 
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follows there exists some word 2; with \z\ < I — levels (p), \z\ < I — level_4((7), and 
S{p,z) G F ^ S{q,z) ^ F. By Lemma[3]we have \viz\ < I and \vjz\ < 1. Now 
put j = |ui| and apply Lemma [2] to w = ViZ. This leads to some p' G Qi with 
S(p',z) £ F ^ w G L{A) and levels (p') < l-yi] < I — \z\, i.e., \z\ < I — levels (p'). 
Since p ^ p' , we conclude S{p,z) G F ^ 6{p' , z) G F ^ ViZ G L{A). Similarly, 
5 {q^ z) £ F VjZ £ L{A), so ViZ £ L ^ ViZ £ L{A) VjZ ^ L{A) VjZ ^ L. 
But ViZ £ L{C) S(S(qo,Vi), z) £ F S(S(qo,Vj), z) £ F VjZ £ L{C), so C 
cannot cover L. □ 

The crucial question now is how to compute an SSD for a given DFCA. An 
efficient method for this task is presented next. 

4 The Decomposition Algorithm 

The algorithm for decomposing the state set of some DFCA {Q, F, S, qo, F) into 
an SSD is an (however nontrivial) adaptation of Hopcroft’s well known method 
for minimizing ordinary DFAs [5]. We first sketch the idea, and then prove its 
correctness. The complexity analysis is given in Sect. [3 

The algorithm manages a decomposition {Qi, ■ ■ ■ ,Qr) of Q such that the 
properties a), b) and d) of an SSD are always satisfied. The initial decomposition 
is Qi := Q\F and Q2 '■= F. As long as there is some state set Qi which violates 
condition c) of an SSD, (i.e. Qi contains two dissimilar states p and g), Qi is split 
into two nonempty parts, where one part (containing p) replaces the current set 
Qi, and the other part (containing q) becomes the new state set Qr+i- We then 
also say that all states in the first part are separated from the states in the second 
part. Now r is increased by one, and the described procedure is repeated until 
an SSD is found. Clearly, the corresponding main loop executes for at most n 
times, where n = \Q\. Furthermore, once two states p and q have been separated, 
they remain in different state sets until the algorithm terminates. 

Some of the state sets determined during the execution of the algorithm are 
additionally stored in a first-in-first-out (FIFO) queue T. With each execution 
of the main loop, the first element of T is extracted from the queue, and possibly 
causes other state sets of the current decomposition to be split. The smaller 
part of each such separated state set is then appended to T. At the beginning, T 
contains only one set, namely F . The main loop eventually terminates as soon as 
T becomes empty. Hence, if the resulting SSD contains r state sets, then exactly 
r — 1 elements have been appended to (and removed from) T. 

From the algorithm shown in Fig.|5]it can be seen that T in fact stores pairs 
of the form (5, k), where S is some state set as described above, and k is some 
integer which corresponds to the levels of the states in S. The exact meaning of 
k will become clear during the correctness proof presented below. Later, we shall 
also analyze the state sets X and Y which are computed in lines 8-9 (rougly 
speaking, X and Y contain the dissimilar p’s and q’s mentioned above). 

Before proving the correctness of the algorithm, let us apply it to the example 
in Fig.[T] (a). Recall that the length I of the longest word in L is five. Starting 
with Qi := {?0j 92, 94, (Ze} and Q2 '■= {91,93,95} (line 3), the algorithm reaches 
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1 If F = 0 Or F = Q Then { Output SSD=(Q); Exit; } 

Compute level(( 7 ) for all q £ Q\ 

Qi ~Q\F- Q 2 := F-r~ 2; 

Initialize FIFO queue T with the only element (F,0); 

5 While T 7 ^ 0 Do { (* main loop *) 

Extract first element (S, k) from T ; 

For c £ E Do { 

X ~ {p \ 5{p, c) £ S A level(p) + k < 1}-, 

Y := {g I 5{q, c) ^ S A level(g) + k < 1}-, 

10 For i := r Downto 1 Do { 

If Q, n X / 0 And Qi n y / 0 Then { 

Choose Z <£ Qi such that Qi C\ X <£ Z and QiC\Y Q Qi\ Z-, 

U\Z\ < \Qi\Z\ Then { 

Qr+i ■= Z; 

15 Q^:=Qi\Z; 

} Else { 

Qr+l '■= Qi \ Z\ 

Qi '■= Z\ 

} 

20 Append fe + 1) to FIFO queue T; 

r := r + 1; 

} 

} 

} 

25 } 

Output SSD=(Qi, . . . , Qr)\ 

Fig. 2. The algorithm which determines an SSD for some DFCA (Q, X, 5, go, F) 

the For-loop in line 7 with S = {gi, < 73 , 95 }, fc = 0 and r = 2 (lines 3, 4, and 6 ). 
Now for c = a we get X = {p \ 6{p, a) £ S A level(p) + 0 < 5} = {go, <? 2 , 94 } and 
analogously Y = {qi,qs, qe} (lines 8 and 9). The inner For-loop (line 10) verifies 
whether Qi has to be split {i = 2,1). The test in line 11 fails for Q 2 because 
Q 2 n X = 0, but for Qi we have Qi H X = { 90 , 92 , 94 } and Qi HY = { 96 }. 
Thus in line 12 the algorithm chooses the only possibility Z = { 90 , 92 , 94 }, and 
in lines 17-18 it sets Q 3 to { 95 }, whereas Qi is overwritten with { 90 , 92 , 94 }- 
Finally ({ 90 }, 1) is appended to the currently empty queue T (line 20), and r is 
increased to three to reflect the new number of sets in the actual decomposition 
(line 21). Observe that the other instance of the inner For-loop with c = b yields 
X = {p\ 6{p, b) £ S A level(p) +0 < 5} = 0 in line 8, so the test in line 11 always 
fails. Furthermore, during the next iteration of the main loop, T becomes empty 
again after extracting ({ 90 }, 1) in line 6, and this time it is easy to see that no 
more state sets are split. Hence T remains empty, and the algorithm terminates 
with the desired SSD ({ 90 , 92 , 94 }, { 91 , 93 , 9s}, {9e})- 

To show the correctness of the algorithm, we first prove some helpful properties. 
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Lemma 4. Let (S' 2 , fc 2 ), (iSa, A: 3 ), . . . , (S'r, A:,.) be the eomplete sequence of ele- 
ments appended to T during the execution of the algorithm. Then k 2 < ■ ■ ■ < kr- 

Proof. Let i be the smallest index satisfying ki > h+i > 0. Then (Si,ki) has 
been appended to T due to some pair (5^, with u < i and + 1 = ki. Later, 

(5'i+i, fci+i) has been appended to T due to some pair (S'„, with u < v < i 

and + 1 = fci+i- But then + 1 = fci > fci+i = + 1 implies ku ^ ky, i.e., 

u ^ V. Hence u < v < i, i.e., u < v — 1 < i, and by our choice of i we have 
kv-i ku = ki — 1 > ki+i — l = ky which in turn contradicts the choice of i. □ 

Lemma 5. For two dissimilar states p, q G Q let 0 < m < I and w G such 

that level{p) < I — m, level{q) <l — m, and 5{p,w) G F ^ 5{q,w) ^ F. Then p 
and q are separated and a pair {S, k) with p G S 4^ q ^ S and k <m is appended 
to T. 

Proof. If m = 0, then w = e and therefore p G F 4^ q ^ F. Moreover, all states 
have a level of at most 1. Thus, from the program lines 1-4, it is easy to see 
that the claim holds. Now assume level(p) < I — (m + 1), level(g) < I — {m + 1), 
and w = bw' for some b G E and w' G E-'^. Let p’ = S{p,b) and q' = S{q,b). 
Then level(p') < level(p) + 1 < I — m and level(( 7 ') < I — m. Additionally, we 
have 5{p' ,w') = 6{p,bw') G F 6{q,bw') = 6{q',w') ^ F. By the induction 
hypothesis, a pair (S', fc') with p' G S q' ^ S and fc' < m is appended 
to T. Now assume p,q G Qj for some j when (S, fc') is extracted from T in 
line 6. Consider the instance of the For-loop in line 7 when c equals b. Since 
level(p) < I — (m +1) <1 — m<l — k' and similarly level(g) <l — k' , it is easy 
to see from lines 8-9 that either p G X A q G Y or q G X A p G Y. Consider 
the instance of the For-loop in line 10 when i = j. The condition in line 11 is 
satisfied, and when line 20 is reached, p and q have been separated. Also, since 
fc' + 1 < m + 1, line 20 appends a pair to T as claimed. 

If p and q were separated before (5, fc') is extracted from T, then the corre- 
sponding state set either has been split while executing lines 1-4 (in this case 
the claim follows immediately), or due to some pair (5", fc") with fc" < fc' by 
Lemma m Then fc" < fc' < m, i.e., fc" -|- 1 < m -I- 1, and the claim follows in the 
same way as presented above. □ 



Lemma 6. When creating a new state set Qr'-i-i ( 1 '' 2^ by splitting some 

previous state set Qi (i < r' ) into two parts, each part contains a state p with 
level{p) < I — fcr'+i. 

Proof. Before appending the pair fc -I- 1) with fc -|- 1 = fc^'+i to T in line 

20, the program code in lines 8-19 ensures that each part contains a state p with 
level(p) -I- fc < Z, i.e. level(p) < I — fcr'+i. □ 



Lemma 7. Let 1 < i < r. Then the state set Qi of the finally computed SSD 
contains a state p with level{p) < I — ki, where fci = 0. 
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Proof. The claim is trivial if the algorithm terminates in line 1. Otherwise, by 
line 3, both Qi and Q 2 are nonempty. Hence, both sets contain at least one state 
p with level(p) < I = I — ki = I — k 2 - Now each time a state set Qi is split into 
two parts, both the new state set Qr'+i and the remaining part of Qi contain a 
state p with level(p) < I — fc^'+i < I — ki by Lemma [6] and H) □ 

Lemma 8. When creating a new state set Qr' (r' >2) it holds that 

yi<j<r'ypGQj\/qG Qr> ■ m.ax{level{p) , level{q)} < I — kri p q . 

Proof. The claim can be easily verified if r' = 2. For r' >2 let Qi be the state 
set from which Qr'+i bas been separated. If j < i, let Qi,...,Qi denote the 
status of the decomposition after creating Qi. Then Qr'+i C Qi and Qj C Qj. 
By Lemma 2] I — kr'+i < I — ki, thus the claim follows from the induction 
hypothesis for r' = i. Similarly, if j > i, let Qi,...,Qj be the status of the 
decomposition after creating Qj. Then again Qr'+i Q Qi and Qj C Qj, and the 
claim follows analogously. 

It remains to prove the claim for j = i. Consider the pair (S', k) (line 6) and 
the symbol c G S (line 7) of the algorithm which caused Qr'+i to be created. 
Then (S,k) equals (Qt,kt) for some t < r', where Qi,...,Qt was the actual 
decomposition after generating Qt. Now we reconstruct what happened just 
before creating Qr-'+i- Firstly, line 20 implies fc^'+i = k + 1 = kt + 1. Secondly, 
from lines 8-9 we see that X = {p \ 6{p,c) G Qt /\ level(p) < I — {kt + 1)} and 
^ = { 9 I ^ Qt level(g) <l — {kt + 1)}. Furthermore, analyzing line 12 

(with Qi representing the still unsplit state set) yields 

y p G Z : (level(p) > I — {kt + 1)) y p G X 



and 

y q G Qi \ Z : (level(g) > I - {kt + 1)) y q G Y . 

Now assume p G Z and level(p) < I — kr'+i = I — {kt + 1). Then p G X and 
p' G Qt for p' := 6{p, a). Since level(p) < I — {kt + 1), we have level(p') < I — kf. 
Similarly, q G Qi\ Z and level(g) < I — kr'+i implies q' := S{q, a) ^ Qt (which 
means q' G Qi U . . . U Qt-i) and level(< 7 ') < I — kt. Therefore the induction 
hypothesis for r' = t yields p' 7 ^ q' . So there is some w G such that 

6{p',w) G F 6{q',w) ^ F, and the word cw G can be used to show 

that p and q are dissimilar. □ 

The correctness of the algorithm follows now. 

Theorem 2. Given a DFCA A — {Q, S,S,qQ, F) covering L, the algorithm in 
Fig.^ correctly determines a corresponding SSD. 

Proof. Clearly the properties a) and b) of an SSD are satisfied. To prove d), 
assume 1 < u < u < r and let Q±, . . . ,Qr' be the first decomposition status 
which satisfies C Q„/ and Qy C Q„/ for some 1 < u' ,v' < r' with u' yf v' . 
Clearly 2 < r' < v. Now either u' = r' or v' = r' , and Lemma [H] implies 
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Vp G Qu' Vg G Qyi : max{level(p), level(g)} < I — kr> => p q . Moreover, 
Qu' contains a state p with level(p) < I — kr' by Lemma [6] (or, for r' = 2, by 
Lemma [7|). Now if m = u', then Lemma [6] and |4] ensure this property also holds 
for Qu - Similarly, if u yf u' , then Qu must have been created after Qr>, i.e. u > r', 
and Lemma [7| and |4] again imply a state p G Qu with level(p) < I — ku < I — kr' ■ 
Applying the same arguments to Qu' and we see that property d) of an SSD 
holds. Finally, by Lemma [S] dissimilar states are always separated. From this 
property c) follows. □ 

5 Complexity 

We now sketch an implementation for the algorithm which consists of two parts. 
The first one deals with the preprocessing work which corresponds to the first 
four program lines. It also verifies whether the algorithm terminates in line 1 
(however this case is trivial and skipped in the following discussion) . The second 
part contains the body of the While-loop. 

We start with studying the data structure used for managing the current 
decomposition {Qi , . . . , Qr}- Each state set Qi is represented by a double-linked 
list of length \Qi\. A simple implementation for this is to manage an array link of 
size n, where link[j] contains prev and next indices indicating the states before 
and after qj in its corresponding list. In order to know where each list begins, we 
manage an additional array head with again n elements, where head [i] stores the 
index of the first element of Qi (for i > r the index is undefined) . Furthermore, 
the function getindex maps Q to {1, . . . ,r}, where getindex{q) = j iff [g] = Qj. 
Another array contains the levels for all states in Q and can be computed with 
the breadth-first-search method |4] in 0{n) time. This bound also applies to 
the other preprocessing work done before reaching the While-loop in line 5. 

An efficient implementation for the FIFO queue T is required. Whenever a 
pair (S,ki) is appended to T, we have S = Qi a,t that point of time. Thus S does 
not need to be saved if we keep track of all parts which are later separated from 
Qi- Thus when Qi is created, we start to manage a single-linked (and initially 
empty) list childreni of indices. Each time a state Qj with j > i is separated 
from Qi, j is added to childreni- Now T can be implemented by a linear size 
array T' storing ki for 2 < i < r, and an index h for T' representing the head 
of the queue. Then line 6 becomes T'[2] := 0 and h := 2, and line 20 becomes 
T'[r + 1] := fc + 1. To extract {S, k) from T, we collect the set C of all indices 
associated with the tree at root h, using the children pointers. Also, we put 
k ■-= T'[h] and h := h + 1- Then S = {q G Qj\j G C}, and thus extracting 
{S,k) takes 0(|5'|) time. Note that testing the condition in line 5 is equivalent 
to verifying h < r- 

Additional data structures are necessary for implementing the While-loop. 
For each q G Q and a G S we require the set S~^{q,a) := {p G Q \ 5{p,a) = g} 
to be stored in a single-linked list. There are two additional linear size arrays 
count and newindex with count[i] and newindex[i] initialized with zero, for all 
1 < i < n. Finally, we define two more functions b and c with domain {1, . . . , n} 
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by setting b{i) = |{q G Qi < 01 c{i) = \Qi\ for i = 1,2, and 

b{i) = c{i) = 0 for i > 2. All initial settings can be easily calculated in 0{n) 
time and space (regarding lAj as a constant). 

The meaning of b and c is as follows. While executing the While-loop, it 
always holds that b{i) = |{g G Qi | level(g) < I — k}\ and c{i) = |Qi|, for all 
1 < * < r, where k has been determined in line 6. It is easy to update c whenever 
state sets are split, but managing b requires for all 1 < j < Z a single-linked list 
Ri containing the set {q \ level((j) = I — i}. Again, these lists can be determined 
in linear time. Now each time k increases by one due to line 6 (see the proof 
of Lemma a, we in turn decrease b{getindex{q)) by one for all q G Rk- Then 
clearly b is correctly managed. Since each list Ri is only required once, the overall 
additional time complexity added to the loop is linear. 

Lines 8-23 are processed in three steps. Using count for temporary data, the 
first step computes \Qi n X\ and \Qi n Y\ for all 1 < i < r, where X and Y 
are as specified in line 8 resp. 9. The states are split during the second step, 
using newindex to hold the indices of the new states. The last step exchanges 
the positions of some state sets and cleans up the auxiliary data structures. 

The first step is as follows. For each q G S and p G S~^{q, a), provided that 
level(p) < I — k, we increase count[getindex{p)] by one. Since the automaton 
is deterministic, each state p occurs at most once. Afterwards we clearly have 
IQiC] X\ = count[i] and \Qi r\Y\ = bi — count[i] for all 1 < i < r. 

To seperate the states, we always choose Z ■= Qi n X, Qr+i := Z, and 
Qi '■= Qi\ Z as in lines 12, 14, and 15. (Later, Qi and Qr+i are exchanged if 
the condition in line 13 is not satisfied.) For each q G S and p G 6~^{q,a) with 
level(p) < I — k, we put i := getindex{p) and test the two conditions count[i] > 0 
and bi — count[i] > 0. If both hold then p must be moved from Qi to Qs, where 
s = newindex[i\. If s = 0 then the new state set has not yet been created. We 
then assign r := r + 1, newindex[i] := r, and setup Qr to only contain p. If s > 0, 
we directly insert p into Qs- Note that each case can be processed in constant 
time. 

During the third step, for all g G S' and p G a), we assign count[i] := 0 

and newindex[i] := 0, where i = getindex{p) . But before doing so, if s > 0 (with 
s = newindex[i]) we verify whether c{i) < c(s) holds, indicating that Qi and 
Qs must be exchanged. Since count[i] = c(s), exchanging Qi and Qs can be 
accomplished in 0{\Qi\ + |Qs|) = 0(c(i) + c(s)) = 0{2-c{s)) = 0{count[i]) time, 
mainly for updating the corresponding getindex entries. 

Note that resetting newindex[i] to zero prevents Qi and Qs from being ex- 
changed twice. So the consumed time for exchanging all state sets cannot exceed 
0(^i<j<„ count[i]) = 0{J2q(=s “)!)■ This bound also applies to the other 

work done during all three steps, and thus to the complete body of the loop 
(lines 6-24), regarding |A| as a constant. Moreover, the automaton is determin- 
istic, thus for all c G A we have 6~^{p,c) n (5“^((7, c) = 0 if p yf g, and hence 
l^~^('7>c)l A n. Finally, when splitting a state set, only the smaller part 
is appended to T. Thus for all g G Q a state set S containing q can be extracted 
from T for at most O(logn) times. Altogether this yields the following result. 
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Theorem 3. An SSD can be determined in 0(n log n) time and 0{n) space. 

Corollary 1. An n-state DFCA can be minimized within the same bounds. 
Proof. We can easily accomplish the construction in Thm. [T]in linear time. □ 



Corollary 2. An n-state DFA accepting a finite language can be converted into 
a minimal DFCA using 0(n log n) time and linear space. 

Proof. The DFA is also a DFCA. □ 

Thus the new algorithm significantly improves the previously known methods 
for converting and minimizing cover automata. Also, we assume that the 
0(n log n) time bound is tight, and state this conjecture as an open problem. 
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Abstract. With the aim of removing the residuary errors made by pure 
stochastic disambiguation models, we put forward a hybrid system in 
which linguist users introduce high level contextual rules to be applied 
in combination with a tagger based on a Hidden Markov Model. The 
design of these rules is inspired in the Constraint Grammars formalism. 
In the present work, we review this formalism in order to propose a 
more intuitive syntax and semantics for rules, and we develop a strategy 
to compile the rules under the form of Finite State Transducers, thus 
guaranteeing an efficient execution framework. 



1 Introduction 

In the context of the use of the tools for part- of- speech tagging (POST) developed 
in the Galena and CORGA project^^, our projects for the automatic processing 
of the Spanish and Galician languages, a repeated request from the linguist part- 
ners has been the design of a formalism to introduce high level contextual rules. 
The purpose of these rules is to remove the residuary errors that pure stochastic 
taggers systematically make, allowing them to improve the usual performances 
of 95-97% of success. 

The constraint grammars formalism (GGs) |S] was a good candidate on the 
basis of its good results: on concrete languages, such as English performances 
can reach 99% of precision with a set of about 1 000 contextual rules; and, 
in general, performances are better than those of the pure stochastic taggers 
particularly when training and application texts are not from the same style and 

* This work has been partially supported by the Spanish Government (under projects 
TIG2000-0370-C02-01 and HP2001-0044), and by the Galician Government (under 
project PGIDT01PXI10506PN). 

^ Galena means Generation of Natural Language Analyzers and Corga means Ref- 
erence Corpus of Current Galician. See http://coleweb.dc.fi.udc.es for more 
information on both projects. 
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source. However, comparison is difficult since some ambiguities are not resolved 
by CGs. That is, CGs return a set of more than one tag in some cases, which 
makes it necessary to combine this technique with another one. 

Furthermore, the syntax and semantics of the rules involved in GGs are 
not very intuitive, since they try to solve problems other than tagging. In fact, 
they join tagging and certain steps of parsing, even though both analyses are 
traditionally treated in separate modules. These aspects definitively lead us to 
design a new formalism for contextual rules. Furthermore, these rules will be 
oriented to operate together with our tagger, a second order hidden Markov 
model (HMM) with linear interpolation of uni-, bi-, and trigrams as smoothing 
technique Pj. 

The aim of the present work is to describe the new formalism of contextual 
rules. This formalism is mainly inspired in GGs, but some aspects from other 
rule-based environments, such as transformation-based error-driven learning [I] 
or relaxation labelling [S], have been also considered. After this, we focus the 
discussion on the efficient execution of the rules, for which we design a strategy 
that compiles them into finite state transducers (FSTs). Finally, we make some 
reflections about time and space complexity of the FSTs obtained. 

2 Contextual Rules Based on Constraints 

The heterogeneous nature of languages for building contextual rules is a current 
problem: there is no standard, and there are many differences between languages. 
Basically, all systems use rules of this kind: if a certain constraint or condition 
is satisfied (for instance, the word following the current word is a verb), then a 
concrete action is executed (for instance, to select the tag substantive). Actions 
are usually limited to selection or deletion of one of the possible tags. However, 
constraints involved in the rules can vary greatly in the type of the condition 
and in the syntax used. 

Our formalism omits syntactic components from the rules, represents the 
same range of conditions as that covered by the rule-based environments cited 
above, but with a more intuitive and legible syntax, and allows us to express 
preceding and following contexts on the basis of both tags and words, whether 
those words are ambiguous or not. The rules can be classified in three different 
groups: local rules, barrier rules and special rules. 

2.1 Local Rules 

The rules that conform this group present the following syntax: 

<action> (<tag>) ([not] <position> <condition> ■[<values>}, ...); 

In <action>, we specify the mission of the rule. The possible actions to execute 
are: 



select, to select one of the possible tags of the current word. 
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— delete, to remove one of those tags, 

— and force, to fix the tag for the current word, even though this tag is not 
present in the set of possible tags for that word. 

The field <position> must be replaced by an integer indicating which is the 
word affected by the condition of the rule: 0 for the current word, -1 for the 
previous one, 1 for the next one, and so on. The possible values for <condition> 
and their corresponding semantics are: 

— is, which is true when the set of possible tags of the word affected by the 
condition and the set of tags <values> specified in the rule are equal, 

— contains, which is true when the set of possible tags of the word affected 
by the condition contains the set of tags <values> specified in the rule, 

— and belongs, which is true when the word affected by the condition is present 
in the set of words <values> specified in the rule. 

Conditions can be combined with the usual logic connectives: 

— for negation, we can use the keyword not in front of the condition, 

— for logic-and, we can write several conditions delimited by commas in the 
same rule, 

— and for logic-or, we can write the conditions in separate rules, all these rules 
involving the same <action> and <tag> fields. 

The following example illustrates the aspect presented by the rules included in 
this grou];13 

force (Det) (1 belongs fsobre}, not 1 is {P}) ; 

This rule fixes determiner as the tag of the current word, when the next one is 
the word sobre and is not a preposition. 

2.2 Barrier Rules 

These rules do not refer a concrete position, but navigate leftwards or right- 
wards from the current word, by replacing the field <direction> by -* or +*, 
respectively, in the following syntax: 

<action> (<tag>) (<direction> <condition-l> {<values-l>} 

barrier <condition-2> {<values-2>} , ...); 

The general constraint of the rule is established by the condition <condition-l> 
expressed before the keyword barrier, and the boundary for navigation is estab- 
lished by <condition-2>. This allows us to express situations like the following: 

select (Adj) (-* contains -[S}- barrier is {V}); 

where if before the current word there is another word that can be a substantive, 
and between those two words there is no verb, the current word will be an 
adjective. 

^ To simplify, in this work, we use Adj for adjective, Det for determiner, P for prepo- 
sition, Pron for pronoun, S for substantive, and V for verb. 
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2.3 Special Rules 

Finally, we have a group for special rules, that always execute an action. Their 
syntax is: 

<action> always (<tag>) ; 

In this case, the possible actions are only select and delete, because the action 
force would produce an output text with all the words tagged with the same 
tag <tag>. An example could be the following: 

delete always (ForeignWord) ; 

When it is sure that the input text to be processed is written only in the local 
language, we can use this rule to remove the tag ForeignWord from all ambiguous 
words that can be a foreign word and something else. These rules could be called 
also rare rules, since they are not commonly used. 



2.4 A Practical Example 

In Table [H we show how an initial trellis with ambiguous words evolves step by 
step when a certain set of contextual rules is applied on it. We have marked with 
boxes the points where individual rules match. The last row contains the final 
trellis obtained after the application of the whole set of rules. 



Table 1. Evolution of a trellis when a set of contextual rules is applied on it 



Rule 


Trellis 


select (S) 

(1 is {V}); 


El sobre esta sobre la mesa 
Pron P V P Det S 

[s] S Pron V 

V V 


delete (V) 

(-2 contains {P,V}); 


El sobre esta sobre la mesa 
Pron S V P Det S 

S Pron 
V 


select (P) 

(-2 belongs {bajo , sobre} , -1 is {V}) ; 


El sobre esta sobre la mesa 
Pron S V |Y] Det S 

S Pron 
V 


force (Det) 

(1 belongs {sobre}, not 1 is {P}) ; 


El sobre esta sobre la mesa 


Pron S V P Det S 

Pron 




El sobre esta sobre la mesa 
Det S V P Det S 

Pron 
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The correct sense of this sentence is the following: The envelope is on the 
table. Therefore, we can observe in the final trellis that some tagging conflicts 
have been solved. However, the trellis still contains ambiguities. This behaviour is 
normal, since the application of contextual rules does not guarantee the removal 
of all ambiguities; but now we could apply the Viterbi algorithm with a greater 

possibility of success. It is necessary to remember that we are in the context of a 
hybrid system for POST, and rules operate together with a HMM-based tagger. 

3 Compilation of Contextual Rules into FSTs 

Instead of processing the input stream forward and backward repeatedly, looking 
for the matching of the conditions involved in the rules, it is advisable to perform 
a previous compilation step that translates the rules into finite state transducers. 
These structures always treat the input stream linearly forward and allow us to 
apply contextual rules more efficiently. 



3.1 Definition of the Input Alphabet 

In order to work with FSTs, we need an alphabet, i.e. the set of symbols that 
conform the input streams. Since rules will be applied on trellises, in which am- 
biguous words can appear, a first alternative was to use pairs as elements of the 
alphabet. In these pairs, the first component would be a word, and the second one 
would be the set of tags associated to that word in the trellis. However, it would 
need to perform operations on sets in order to transit adequately between the 
states of the FSTs, and those operations could involve too high a computational 
effort for every single transition. Therefore, this approach was rejected. 

A better mechanism for the symbolic compilations of the rules is the assign- 
ment of integers to words and tags. Furthermore, in order to obtain FSTs which 
are as compact as possible, we do not consider the whole tagset and the whole 
dictionary of our system, but only the set of tags and the set of words that ap- 
pear in the rules (we will call these sets TR and WR, respectively). In this way, 
the exact correspondence is the one shown in Table |5] 0 is reserved for the empty 
string e; -1 and 1 will be the beginning and the end of sentence, respectively; the 
tags that appear in the rules will be numbered forward from 3 to n, and stored 
in a structure that we will call mini-tagset; the words that appear in the rules 
will be numbered backward from -3 to — m, and stored in a structure that we 
will call mini-dictionary, and finally, -2 and 2 will be reserved as wildcards to 
represent any other words or tags, respectively, that do not appear in the rules, 
but can appear in a new sentence to be tagged and in its corresponding trellis. 



Table 2. Mapping for words and tags that can appear in trellises 



w € WR 


w ^ WR 


Start 


£ 


End 


t ^ TR 


t £ TR 


1 

1 

1 

CO 


-2 


-1 


0 


1 


2 


3,4, ...,n 
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Table 3. A mini-tagset and a mini-dictionary 



Mini-tagset 


Mini-diet ionary 


Det ^ 3 
P ^ 4 
S ^ 5 
V 6 


bajo — ^ -3 
sobre — > -4 



Table 4. A trellis and its corresponding mapping 



El sobre esta sobre la mesa 


-2 -4 -2 -4 -2 -2 


Pron P V P Det S 

S S Pron V 

V V 


2 4 6 4 2 5 

5 5 3 6 

6 6 



The set of rules used in the example of Table [H involves four tags and two 
words, so we can build the mini-tagset and mini-dictionary shown in Table E] 
By using these structures and the mapping explained above, we can transform 
the initial trellis of that example into the one shown in Table [H 

Now, we collect all the integers of this new trellis by columns and add the 
marks of beginning and end of sentence in order to obtain the corresponding 
input stream for a cascade of FSTs implementing the contextual rules: 

-1 -2 2 -4 4 5 6 -2 6 -4 4 5 6 -2 2 3 -2 5 6 1 

The FSTs will translate this input stream into an output stream with the same 
aspect. After this, we apply the inverse procedure to retrieve the final trellis 
obtained from the application of the contextual rules. 



3.2 Building an FST for Every Contextual Rule 

The next step consists in the construction of an individual FST for every con- 
textual rule to be applied. Firstly, we describe this process intuitively by using 
the following rule: 

select (S) (1 is {V}) ; 

A possible version of the corresponding FST is shown in Fig. [1] In this case, the 
action must be represented first (from state 0 to state 2), since the condition 
(from state 2 to 5) affects a word that appears afterwards. In both cases, action 
and condition, the word must be represented before its possible tags. We will 
draw arcs with thick lines for words and arcs with normal lines for tags. 

— Action. In this rule, the current word can be any word, hence we use the 
wildcard -2 and all the words present in the mini-dictionary (baj o and sobre) 
in the label of the arc from state 0 to 1. In a trellis, only tags can be modi- 
fied, not words. This is the reason why these three possible cases for words 
are translated into themselves. The action of the rule, i.e. to select the tag 
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-2 : -2 
bajoibajo 
sobre : sobre 






2 : 0 
Det : 
P: 0 






-2 : -2 
bajorbajo 
sobre : sobre 



-@- 



'©- 



1:1 
-2 : -2 
bajorbajo 
sobre : sobre 



condition 



Fig. 1. Fictitious FST for the rule select (S) (1 is {V}) ; 



S'S 

substantive, is represented by the arc 1 — ^ 2, which does not modify the 
tag, and by the loops in states 1 and 2, which remove the rest of possible 
tags by translating them into the empty string 0. We use two loops instead 
of only one because we assume that the tags will always appear in a prede- 
fined order (usually, the lexicographic order, which should coincide with the 
order defined in the mapping), and that this order will not be broken by any 
sequence of iterations with different tags in the same loop. These hypotheses 
are not critical in practice, and allow us to build more compact FSTs. Be 
that as it may, the most important aspect is that an arc keeping the tag 
substantive is mandatory and cannot be included in any optional loop. 

— Condition. The same reasoning made above is applicable to the word in- 
volved in the condition (the arc from state 2 to 3). The rule requires this 
word to be precisely a verb, hence the presence of only one arc for its tags 
(3 4, which does not modify the tag since we are implementing a con- 

dition, not an action). This is enough for the example under consideration, 
since verb is the last tag in the mini-tagset. However, it is a better alternative 
not to allow state 4 to be the final state of this FST, because in general it 
cannot be guaranteed that verb is the unique tag. Therefore, another word 
or the end of sentence must appear after the tag (hence the final arc from 
state 4 to 5). 

The FST presented above is fictitious and has been shown only to facilitate 
understanding. In practice, words and tags must be replaced by their integers 
associated in the mapping, producing the corresponding real implementation of 
the FST shown in Fig. E] 

The designing and building of FSTs becomes more difficult when the condi- 
tion is complex. For instance, the construction of transducers for barrier rules or 
for local rules with inclusions or negations is much more complicated than for the 
simple equality shown above. Nevertheless, this process of symbolic compilation 
has been implemented for all the rule schemes covered by our formalism. Since 
it is not possible to describe the whole compilation process here, we include only 
one more example in order to illustrate how FSTs always prove to be a robust 
frame to perform any operation involved in this kind of contextual rules. In this 
case, we consider the generic barrier rule 

select (ts) (-1- * is ■ ■ ■ ,U} barrier is {tx,ty, . . . , t^}) 
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1:1 




3 : 0 

4 : 0 



Fig. 2. Real FST for the rule select (S) (1 is {V}) ; 



The aspect of the abstract version of the corresponding FST is shown in Fig. E] 




Fig. 3. Generic FST for the barrier rule select (ts) (-1- * is {tj,tk , barrier is 
{tx, ty, ... ^tzY) 

3.3 Making the FSTs Operative 

Let us consider, for instance, a simple transducer which transforms 3 into 4 when 
1 appears before the 3. This FST will produce the output stream 1 4 only if the 
input stream is exactly 1 3. However, we want this FST to produce the same 
effect for any other sequences of symbols containing 1 3 as substring. In order to 
do this, we need to apply a normalization process on the FST. That is, once we 
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have built a transducer for every contextual rule, the next phase is to normalize 
all these FSTs. The normalization process requires the following steps: 

— Compilation and closure of the alphabet. Once we have defined the input 
alphabet, which we will call S, it is necessary to create a FST to transform 
every symbol into itself. Then, we obtain S* , where * is the closure operation, 
in order to be able to process input strings with more than one symbol. 

— For every transducer T, we perform the following operations: 

• a = pi(T), where pi is the first project operation, i.e. every arc q r 
is replaced by q r or by q r. Therefore, a can be seen as an 
automaton or as a transducer that does not change its input. 

• r = S* — {E* ■ a - E*), where — is the subtraction operation and • is the 
concatenation operation. In this way, we obtain in T a transducer that 
accepts the complement of the language accepted by a (and by T). 

• Finally, N = F ■ (T ■ F)*, in order to obtain in N the normalized version 
of the transducer T. 

Once we have normalized all transducers, we can form a cascade by applying 
them in sequence, taking the output of an FST as input for the following one. 
Another option is to build only one FST able to simulate the simultaneous 
application of all contextual rules by only one pass over the input stream. This 
FST can be obtained by composing all individual transducers. 

The operations of closure, project, subtraction, concatenation, composi- 
tion, etc., are well-defined in the theory of FSTs, and there are tools avail- 
able which implement all these procedures. In our case, we have used the free 
version of FSM Library: General-purpose finite-state machine software tools, 
available in http://www.research.att.com/sw/tools/fsm/, implemented by 
Mohri, Pereira and Riley from the ATSzT laboratories [3]. This library has al- 
lowed us to build the final operative version of our FSTs. 



3.4 Space and Time Complexities 

The FSTs obtained have always operated correctly and rapidly, which shows that 
the general compilation procedure is robust. For instance, the spaces and times 
consumed by the FSTs corresponding to the rules of Table [H appear in Table E] 
where Ti is the FST of each individual rule, and Ti 234 is the composition of the 
four preceding FSTs. It is important to note that, in theory, the time complexity 
of the FST obtained by composition is linear respect to the length of the input, 
and does not depend on the number of rules. But in practice, even though in a 
hybrid system for POST a small set of contextual rules (between 50 and 100) is 
expected to be sufficient, the computational cost of the composition operation is 
very high, and it is more advisable to perform a sequential application of the rules 
by executing the cascade of the corresponding individual FSTs: for time, 0.247 
vs. 1.717 seconds, in a Pentium III 450 MHz. under Linux operating system; and 
for space, 20160 vs. 2 931 156 bytes, in all cases with only 4 contextual rules. 
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Table 5. Spaces and times for a set of FSTs and for their corresponding composition 





Initial FST 


Normalized FST 


FSTs 


number 


number 


size 


number 


number 


size 


compilation 


execution 




of states 


of arcs 


in bytes 


of states 


of arcs 


in bytes 


in seconds 


in seconds 


Ti 


6 


16 


348 


19 


142 


2 520 


0.245 


0.051 


T2 


11 


32 


664 


58 


520 


9 036 


0.878 


0.072 


Ta 


8 


26 


532 


39 


346 


6 024 


0.586 


0.070 


n 


7 


23 


472 


20 


145 


2 580 


0.251 


0.054 


Total 


32 


97 


2 016 


136 


1153 


20160 


1.960 


0.247 


7 i 234 


- 


- 


- 


30 244 


160513 


2 931 156 


11.773 


1.717 



4 Conclusion and Future Work 

We have presented a strategy to compile constraint-based contextual rules for 
part-of-speech tagging under the form of finite state transducers. The purpose 
of these contextual rules is to remove the residuary errors that pure stochas- 
tic taggers make. These contextual rules must be introduced by linguists, and 
therefore we also provide a new formalism with intuitive syntax and semantics. 
Transducers have proved to be a very formal and robust execution framework 
for contextual rules, but there are still some aspects that should be investigated 
further. In the context of our hybrid system for POST, where contextual rules 
operate in combination with a HMM-based tagger, it is necessary to perform 
experiments with the purpose of selecting the best order in which to apply these 
two disambiguation techniques. Another aspect of future work, in this case for 
situations where it is possible to detect that residuary errors are systematic, is 
the automatic generation of specific contextual rules for these errors. 
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Abstract. Finite state networks can represent dictionaries and lexical 
relations. Traditional finite-state operations like composition can pro- 
duce huge networks with prohibitive computation space and time. For a 
subset of finite state operations, these drawbacks can be avoided by us- 
ing virtual networks, which rely on structures that are partially built on 
demand. This paper addresses the implementation of virtual network op- 
erations in xfst (XEROX Finite State Technology software). The example 
of “priority union” which is particularly useful in NLP, is developed. 



1 Introduction 

Finite-state techniques are widely used in various areas of Natural Language 
Processing (NLP) such as morphological analysis, tokenization [T^ and 

part-of-speech disambiguation. Finite-state transducers, which are a generaliza- 
tion of finite-state automata, can be used to efficiently represent lexical relations. 
Such “lexical transducers” have many advantages: they take up little memory, 
are robust, and are available as complete products for commercial applications. 

In addition to memory considerations, existing finite-state automata can be 
powerfully and flexibly combined together into new automata using finite-state 
operations such as, for example, union, priority union, intersection and compo- 
sition. 

Because of memory and time considerations, the full computation of such 
new automata may be intractable. Lazy finite-state algorithms make it possible 
to perform these operations partially, building them on demand at runtime as 
required to handle specific inputs. Lazy algorithms have already been designed 
with success to handle operations on automata such as determinization (see 
IPP for an application to pattern matching). Our aim here is to implement the 
so-called “virtual networks” and their algorithms based on lazy evaluation. 

In the next section, we will introduce finite state transducers (FST) by giving 
some definitions and computational considerations. In the third section, we will 
explain the advantages of xfst “virtual networks” j7] and how they work. The 
priority union will be taken as example. 
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2 Transducers 

A Finite-State Transducer (FST) [121 is a machine that encodes a relation be- 
tween two regular languages. It behaves like a FSA (Finite State Automaton), 
also called an acceptor. For each accepted word from the input language, a FST 
returns one or more related words of the output language. 

2.1 Definitions and Operations 

Definitions. A FST is a 6-tuple (A, J7, Q, I, F, E) where: S is the input alpha- 
bet, is the output alphabet, Q is the (finite) set of states, I C Q is the set of 
initial states, F c Q is the set of final states, E C Q x {S Li {e}) x x Q a, 
finite set of transitions. 

As the input side of a FST can be seen as a FSA, we similarly define a 
transition function 6 mapping Qx{EU{e}) to 2^ hy 6{q, a) = {q' \ 3{q,a,b,q') G 
E}. It can be extended to input words in a mapping 6* from Q x E* to 2^. 

Given a FST T = {S, 17, Q, I, E,E), a path from pi to (?„ in T is a sequence 
{pi, Ui, bi, qi)i=i,n of edges of E such that = Pi+i,i = I ... n — I. A successful 
path is a path from an initial state to a final state. A word w is recognized by a 
transducer if and only if € / | 5*{i, w) n F yf 0. 

Transducers are only able to express a class of relations: the regular ones 
(the link between finite state transducers and regular relations can be shown in 
analogy to Kleene’s theorem m- The construction of transducers from regular 
relations can be performed in the same way as for automata from regular ex- 
pressions |2]. Empty transitions (e-transitions) are simply replaced by (e, e). In 
this way, we can define union, concatenation and closure of transducers. 

We will use the term finite-state network to cover both simple automata and 
transducers. By convention, our networks have one initial state, shown as the 
leftmost state in our diagrams, and we can apply them in both directions: from 
the upper side to lower side and vice versa. The final state is represented as a 
double circle. We use upper side for the input side and lower side for output 
side. Notice that the following definitions on regular operations could be writ- 



Fig. 1. A path in a lexicon transducer for English with “Lexical side” as upper side 
and “Surface side” as lower side. “-|-Adj”, “-|-Comp” are atomic symbols. 



Lexical side ; 




b 




Surface side : 



ten without using e-transitions (Tj, but we use them here to produce simpler 
constructions. 
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Network A 




Network B 

Fig. 2. The two networks used in our examples. 

Union. Given two FSTs Ti = and T2 = {S2,i^2, 

Q2,i2, F2, E2), encoding the relations Ri and R2, respectively, the relation 
i?3 = i?i U i?2 is encoded by the FST 



T3 =TiUT 2. (1) 

such that: T3 = (Ui U E2, U U2, Qi U Q2 U {13}, {isjjFi U F2, E3), with E3 = 
E1UE2U {{is, e, e, ii), {is, e, e, h)}- 




Fig. 3. Union of networks A and B, e is the identity relation e : e. 



Concatenation. Given two FSTs Ti = {Si,ni,Qi,ii,Fi,Ei) and T2 = 
{S2, f^2,Q2,i2, F2, E2), encoding the relations i?i and R2, respectively, the rela- 
tion Rs = Ri ■ i?2 is encoded by the FST 



Ts = Ti-T2. 



( 2 ) 



such that: Ts = {Ei U E2, U U2, Qi U Q2, {[*i}, F2,Es), with Es = EiU E2U 

{(/i,e,e,*2) I /i G Fi}. 
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Kleene Star. Given a FST T = {E, [ 2 ,Q,{i}, F, E), encoding the relation R, 
the relation R* = is recognized by the FST 

T* = (J (3) 

fceN 

such that T* = E,Q U {i\}, {ii}, {i}, Ei) and Ei = E U {{ii, e, e,i)} U 
{(f,e,e,i)\f€E}. 




Fig. 4. The Kleene star of the network A. 



Composition. Given two FSTs Ti = {E, Q,Qi,{ii}, Ei, Ei) and T2 = 
(17, r, Q2, {*2}, T2, -E2), encoding the relations Ri and R2, respectively, the rela- 
tion i?3 = o i?2 is encoded by the FST 



T3 = Ti o T2 . 



(4) 



such that: T3 = {E,F,Qi x Q2, {*1}, {*2}, -Fi x E2,Es), with E^ = 
{((91, 92), a, 6, (gi, 52)) I 3 cG 12U{e} I (qi,a,c,q[) G Ei and {q2,c,b,q2) G £^2} 




Fig. 5. Result of “network A composed with network B” . 



2.2 Computational Considerations 

A very common operation on FSTs in NLP is the composition of several trans- 
ducers. Let us consider it to examine some computational problems. 

The simplest way to perform the operation is to simulate the composition by 
applying a string or a set of strings on the input of the first transducer, retrieving 
on its output the related string set and applying that string set to the input of 
the second network, and so on for each stage of the composition. But at each 
level of the composition, the strings of the output string set that are accepted as 
input by the next transducer belong to the intersection of the output language 
of the transducer with the input language of the subsequent transducer. That 
is, in the general case, multiple strings may be generated at each level of the 
composition, but some or all of them may be rejected or filtered out by the next 
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transducer. A more efficient way to perform composition of several transducers is 
to pre-compile a single FST corresponding to the relation between the very first 
input language and the very last output language. In this way, the single pre- 
composed transducer directly maps an applied string to the final result without 
producing any intermediate results. 

In some cases, however, compiling a single transducer will require a huge 
amount of memory. In the case of composition, in the worst case, the number 
of states in the composed network is the product of the number of states in the 
original networks. 

Another reason for why compiling is not possible can be that the desired 
behavior cannot be predicted in advance. As an example of unpredictability, 
consider the following case. A transducer T for morphological analysis is com- 
posed of four transducers: 



T = Ti o Ti o T3 o T4 ( 5 ) 

Morphological analysis means mapping a surface string to a string that can 
be read as a morphological analysis of the input string, e.g., the French word 
“pensons” to “penser-|-Verb-|-PInd-|-lP-|-Pl”, indicating the first-person plural 
present indicative form of the verb “penser”. Let T\ be a filter that blocks cer- 
tain words, and T4 be a transducer that provides a mapping between Unicode 
strings and another character encoding or transliteration for the language. And 
let us assume that there is a set of alternative filters, say Tn, T12, ^ 13 , and Ti 4, 
and a set of alternative encoding transducers, say T41, T^2, F43, and T44, that are 
appropriate in various contexts and which are selected by the user at runtime. 
The precompilation of T to serve all possible combinations of filters and encod- 
ings would require constructing 16 different variants in this simple example. In 
interactive systems, users could conceivably control dozens of options affecting 
the final behavior of a transducer, which could raise the number of variants to 
several thousands. 

3 Virtuality 

To avoid precompilation in such cases, the final transducer can be built virtually. 



3.1 Principles 

The aim is to work with an object that represents the result of the combinations 
of the original networks (virtual operation can also be a unary operation like 
virtual determinization), and to build the true result network incrementally, 
in a “lazy” fashion, as required in real use. Typically, when we apply a virtual 
operation on a real network or networks, we obtain a virtual network that has the 
same properties as a real one and can participate in further virtual operations. 
Although it is not directly exploitable, the virtuality mechanism makes the lazy 
evaluation transparent for the end user. 
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Thus, a “virtual transducer” can be represented by a tree whose leaves are 
“physical transducers” and whose non-terminal nodes are operators such as 
union, intersection, concatenation, composition, or negation. Each time we need 
to use parts of the virtual transducer, this tree is partially evaluated as required 
to build the physical states and arcs to handle the actual input. 

In this way, the resulting transducer is available immediately, instead of wait- 
ing for a long computational time, and memory is used only for the parts of the 
final transducer that are actually used. The price to pay is somewhat slower 
operations since they need to construct physical states and arcs at runtime to 
work. 

Different Types of Virtuality. Basically, the principle of lazy operations m 
is evaluating variables or functions only when they are needed and not necessarily 
keeping the result. In the present case, it is advantageous to retain in memory the 
compiled physical parts of the network since they can be re-used often, networks 
act as if cached, so, for example, an operation will take more time to execute 
the first time, and will execute more quickly when repeated. 

We can now distinguish two types of virtual networks: virtual networks on 
states, and virtual networks on arcs. 

In virtual networks on states, the states are virtual. To follow a path in a 
virtual transducer, we need to realize all the states on this path (make them 
physical). This requires the arc set of each state to be physically constructed. 

In the second type of virtual network, the arcs are virtual. That is, we just 
need to realize arcs that are on the requested path, not the whole arc set of all 
the states that are on the path. What we are working on is an implementation of 
the first type of virtual network, based on the original implementation designed 
by Ronald M. Kaplan, John Maxwell and Lauri Karttunen. In this version, we 
need two special functions in order to realize states: an “arc set function” that 
builds the outgoing arcs of the state being realized, and a “finality function” 
that decides whether a state being realized must be final or not. 

Concretely, when virtual operations are applied, a new network is built that 
contains just a virtual initial state and is associated with a data structure de- 
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Fig. 6. Two types of virtuality. 
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scribing the type of virtual operation and giving the two appropriate “arc set” 
and “finality” functions. This virtual initial state is associated with each initial 
state of the original physical or virtual networks. 



Arc Set Functions. For a state to be realized, we need to build its whole set of 
outgoing arcs according to rules, depending on the virtual operation being per- 
formed from the set of outgoing arcs of each associated original state. Physical 
arcs are created that point to new virtual states. These virtual states are associ- 
ated with physical (resp. virtual) state(s) in the original physical (resp. virtual) 
network(s). The rules for creating the new arc set must be locally decidable since 
the function can only access information about the current virtual state and its 
associated states. 

For example, in a virtual network given by the virtual union of two networks, 
the arc set function will create an arc for all the arcs in the arc set of the two 
associated states with the same labels. This arc will point to a virtual state 
that will be associated with all the destination states of the matching arcs. This 
is fully decidable because all of the words in both of the two networks will be 
present in the resulting network. 

However, in the case of the virtual intersection, we will need to do the same 
work and build arcs that will later be revealed to be useless. This is due to words 
in each network that can be prefixes of words in the other network. In such a 
case, the decidability of the virtual operation will rely on the finality function. 



Finality Functions. When a virtual state is realized, the finality function will 
decide if the physical version of this state has to be final according to rules de- 
pending on the virtual operation performed. In the general case, it corresponds 
to the logical definition of the operation. To continue with the examples previ- 
ously taken, the finality function for a “virtual union” will perform a logical OR 
between the finality value of all the associated states, whereas it will perform a 
logical AND for a “virtual intersection” . 

3.2 The Priority Union 

Let us see in more detail the behavior of our virtual networks by taking the 
virtual priority union jl] as an example. We can give two similar definitions of 
the priority union depending on the sides of the transducers we apply it on. On 
the upper side, the priority union will give the whole set of string pairs of the 
first transducer, and the strings pairs from the second transducer whose upper 
string is not recognized by the first transducer. Mathematically: 

p 

Ti[JT2 = TiU {^upper{Ti) o T2), 
with upper (Ti) means the upper side of Ti. 



( 6 ) 
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The other definition is: 

Ti|jT2 = TiU(T2 0^/ou;er(Ti)), (7) 

p 

with lower(Ti) means the lower side of Ti. 

The type of priority union to be used depends on the side of the resulting 
transducer we plan to apply strings on. The priority union is very useful in NLP 
because it permits lookup on transducers or dictionaries in a sequential way. 
More precisely, it can replace batch programs that try lookup on a sequence of 
dictionaries and stop at the first lookup that returns a non-empty point. 

Our current implementation works in a binary way, that is each virtual state 
is associated with a pair of states from the original networks. If these states are 
non-deterministic, which means states have multiple arcs with the same symbol 
on the input side, several virtual states and state pairs are created. As each 
virtual state corresponds to a pair of states in the original networks, both of the 
“finality” and “arc set” functions deal with the following cases: 

— A : (state from Netl, state from Net2); 

— B : (state from Netl, No state); 

— C : (No state, state from Net2). 

Moreover, the initial state is the object of a particular treatment for two 
reasons: since the priority union of two transducers contains the whole relation 
described by the first one, all arcs from its initial state are copied without any 
selection and point to new virtual state with state pairs of type A. Two networks 
could occur whose initial states both own loop arcs with same label. These loop 
arcs must not be reproduced on the initial state of the resulting network in order 
not to be concatenated with the network they do not belong to. To avoid this, 
we ensure that the initial state of the resulting network is not associated with 
the pair made of the initial states of the original networks. 

Following these considerations let us explain the work of the two “virtual” 
functions. The treatment made by the arc set function is quite simple. In cases B 
and C, it just has to copy the arc set of the associated state and make it point to 
virtual states of the same type (respectively B or C) . In case A, two arc sets have 
to be processed: arcs that match on the upper (lower) side between the two arc 
sets produce arcs labeled with the labels of the matched arcs from the second 
network, and point to a new virtual state corresponding to the target of the 
matching arcs in the two networks (type A) . Other arcs (that do not match with 
other arcs in the other network) are replicated and point to a virtual state of type 
B or C, depending on whether they belong to the first or to the second network. 
As for the virtual intersection, the arc set function cannot locally determine 
whether a pair of matching arcs is useful or not, and let the finality function 
decide. The finality function has to determine the finality of a given virtual state 
according to the type of this state. Type B and C are easy to process. They 
correspond respectively to paths that are only in the first network, or only in 
the second one. In these cases the function has just to return the finality of the 
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Fig. 7. The priority union (on the upper side) of network A and network B. 



single associated state. The case A, however, corresponds to paths that have 
been followed in both networks. Then the function needs to apply the logical 
formula on the finality of the associated states: 

^Finality(StateA) A Finality(StateB) (8) 

This allows adding successful paths in the resulting network that come from the 
second network and cannot be found in the first. 

4 Conclusion and Future Work 

Virtual operations are useful because some operations on transducers can’t be 
realized without them. This is the case when we manipulate large transducers. 
Sometimes, we don’t need the whole result of an operation. For example, to 
perform some lookups on a normalized German lexicon, we need to compose the 
lexicon network with a normalization filter network. In our test (on a Sparc Ultra 
10 with 150 MHZ CPU, 128 Mo memory), this composition took 27 seconds, and 
lookups were immediate. With the virtual composition, the result is immediate 
and lookups remain instantaneous. 

Some modifications must be done in the virtual priority union implementa- 
tion to treat one-sided epsilons. Some optimizations can be added in order to 
manage more efficiently the memory used by virtual transducers. We also plan 
to consider the weighted case m and to extend virtuality to weighted networks. 
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Abstract. The state complexities of basic operations on nondetermin- 
istic finite automata (NFA) are investigated. In particular, we consider 
Boolean operations, catenation operations - concatenation, iteration, A- 
free iteration - and the reversal on NFAs that accept finite and infinite 
languages over arbitrary alphabets. Most of the shown bounds are tight 
in the exact number of states, i.e. the number is sufficient and neces- 
sary in the worst case. For the complementation tight bounds in the 
order of magnitude are proved. It turns out that the state complexities 
of operations on NFAs and deterministic finite automata (DFA) are quite 
different. For example, the reversal and concatenation have exponential 
state complexity on DFAs but linear complexity on NFAs. Gonversely, 
the complementation can be done with linear complexity on DFAs but 
needs exponentially many states on NFAs. 



1 Introduction 

Motivated by several applications and implementations of finite automata in soft- 
ware engineering, programming languages and other practical areas in computer 
science, the state complexity of deterministic finite automata has been studied 
in recent years. For example, the state complexity of the intersection of DFAs 
has been studied in m- A tight bound of 2" states for the reversal has been 
shown in [J], whereas catenations and other operations are the main topic of m- 
For the important case of finite languages results have been obtained in jT]. A 
state-of-the-art survey can be found in m- Related to the problem of finding 
upper bounds for the state complexity is the problem of efficiently simulating 
nondeterministic automata by deterministic ones. For example, transforming a 
certain type of NFA to a DFA gives an upper bound for the corresponding NFA 
state complexity of complementation. Results concerning the simulation prob- 
lems have been shown in mm- 

As pointed out in m there are several good reasons why the size of DFAs 
is a natural and objective measure for regular languages. E.g. the size of a DFA 
is linear to the number of states of a DFA, but this is not necessarily true for 
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NFAs. On the other hand, the influence of the degree of nondeterminism on the 
power and limitations of certain devices is an important question in descriptional 
complexity theory. Finite automata with limited nondeterminism have been con- 
sidered in |6] where an infinite nondeterministic hierarchy of regular languages 
has been proved. The issue of quantifying inherent nondeterminism in regular 
languages is dealt with in [3]. The relation between ambiguity and the amount 
of nondeterminism is considered in [J . 

We expect that examining the state complexity of basic operations on NFAs 
will enhance the understanding of the relations between nondeterminism, ambi- 
guity and the power of finite automata. 



2 Preliminaries 

We denote the powerset of a set S' by 2‘®. The empty word is denoted by A, the 
reversal of a word w by w^, and for the length of w we write |w|. For the number 
of occurrences of a symbol a in w we use the notation ^a{w). 

A nondeterministic finite automaton (NFA) is a system (S, A, S, sq, F), where 
(1) S is the finite set of internal states, (2) A is the finite set of input symbols, 
(3) So G 5' is the initial state, (4) A C S is the set of accepting (or final) states, 
and (5) (5 : S X A ^ 2“® is the transition function. 

The set of rejecting states is implicitly given by the partitioning, i.e. S\ F. 

If not otherwise stated throughout the paper we assume that the NFAs are 
always reduced. This means that there are no unreachable states and that from 
any state a final state can be reached. An NFA is said to be minimal if its 
number of states is minimal with respect to the accepted language. Since every 
n-state NFA with A-transitions can be transformed to an equivalent n-state NFA 
without A-transitions [3] for state complexity issues there is no difference between 
the absence and presence of A-transitions. For convenience, we consider NFAs 
without A-transitions only. 

As usual the transition function 6 is extended to a function A : S x A* ^ 2^ 
reflecting sequences of inputs as follows: A(s,wa) = Us'eA(s w) where 

Z\(s, A) = {s} for s G S', a £ A, and re G A*. In the sequel we always denote the 
extension of a given <5 by Z\. 

Let A — {S,A,S,so,F) be an NFA, then a word ic G A* is accepted by 
A if Z\(so,rc) n A 0. The language accepted by A is L{A) = {w G A* | 
w is accepted by A}. 

The next two preliminary results involve NFAs directly. They are key tools 
in the following sections, and can be proved by a simple pumping argument. 

Lemma 1. Letp > I be an arbitrary integer. Any NFA that accepts the language 
{a^}+ resp. {a^}* needs at least p+ 1 resp. p states. 

The (p-l-l)th state is necessary since the initial state has to be a non-accepting 
one. If we modify the language to {a^}* then the initial state could be equal to 
the accepting state. 



150 



M. Holzer and M. Kutrib 



3 Boolean Operations 

We start our investigations with Boolean operations on NFAs that accept lan- 
guages over arbitrary alphabets. In the case when the finite automaton is deter- 
ministic it is well-known that in the worst case the Boolean operations union, 
intersection and complementation have a state complexity of m • n, m ■ n and 
m, respectively. However, the state complexity of NFA operations is essentially 
different. At first we consider the union. 

Theorem 2. For any integers m,n >1 let A be an m-state and B be an n-state 
NFA. Then m + n + 1 states are sujfieient and necessary in the worst case for 
an NFA C to accept the language L{A) U L{B) . 

Proof. In order to construct an (m + n + l)-state NFA for the language L{A) U 
L{B) we simply use a new initial state and connect it to the states of A and B 
that are reached after the first state transition. 

Now we are going to show that m + n + 1 states are necessary in the worst 
case. Let A be an m-state NFA that accepts the language {a™}* and B an 
n-state NFA that accepts {5"}*. 

Let C be an NFA for the language L{A)UL{B) . In order to reject the inputs a*, 
I < * < m — I, but to accept the input a™ the NFA C needs at least m — 1 
non-accepting states si, . . . , Sm-i from each of which a final state is reachable. 
Similarly, C needs at least n — 1 states sj, . . . , s'„_i for processing the inputs 6*, 
1 < * < n - 1. 

Denote by Pa resp. Pf, the set of states that are reachable by inputs of the 
form a* resp. 6* for i > 1. None of the final states may be reachable from the 
states in Pa H Pb- Otherwise words of the form or would be accepted. 

It follows that neither the Si nor the s' may belong to the intersection PaOPb- 
But, trivially, they do belong to Pa resp. to Pb- Now consider all words {a™}’*'. 
There must exist a final state s^ that accepts infinitely many of them. Thus, s^ 
is reachable from s^ itself. The same holds for a state s^ for the words in 
It follows Sm G Pa and s^ G Pb but Sm ^ Pa F Pb and s'^ ^ PaP Pb- Finally, the 
initial state sq must be a final state since A G L{A) U L{B), but sq yf Sm and 
So yf s'a since otherwise {a’”}'6"’ or {6"}*a"* would be accepted for some i G N. 
Altogether, Pa U Pb must contain at least m + n different states that are not 
equal to the initial state. □ 

When we are concerned with finite languages the state complexity of the 
union can be reduced by three states. For these upper bounds in the deterministic 
case see [2]. We may assume w.l.o.g. that minimal NFAs for finite languages not 
containing the empty word have only one final state. Since such NFAs do not 
contain any cycles they do contain at least one final (sink) state for which the 
transition function is not defined. Now a given minimal NFA with more than 
one final state is modified such that a sink state becomes the only final state. 
Therefore simply the transition function has to be extended. If the finite language 
contains the empty word, then in addition the initial state is a second final one. 
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Corollary 3. For any integers m, n > 2 let A be an m-state NFA and B be an 
n-state NFA. If L{A) and L{B) are finite, then m + n — 2 states are suffieient and 
necessary in the worst case for an NFA C to accept the language L{A) U L{B). 

Proof. We can adapt the proof of the previous theorem as follows. Since NFAs 
for finite languages do not contain any cycles, for the construction of the NFA C 
we do not need a new initial state (this saves one state). Moreover, we can 
merge both initial states (this saves the second one) and both final sink states 
(this saves the third one). Now the construction of C is straightforward. 

The finite languages {a™} and {6"} are witnesses for the necessity of the 
number of states for the union in the worst case. An NFA that accepts the 
language {a"*} needs at least m + 1 states. Otherwise it would run through 
cycles. By the same argumentation as in the proof of Theorem |2] and merged 
initial and sink states we obtain at least (m + 1) + (n + 1) — 2 states for an NFA 
that accepts {a"*} U {5"}. □ 

The complementation of NFA languages is an expensive task at any rate. 
It is well known m that 2" is the tight upper bound on the number of states 
necessary for a deterministic finite automaton to accept an (infinite) n-state 
NFA language. Since the complementation operation on deterministic finite au- 
tomata neither increases nor decreases the number of states (simply exchange 
final and non-final states) we obtain an upper bound for the state complexity of 
the complementation on NFAs. 

Corollary 4. For any integer n > 1 the complement of an n-state NFA language 
is accepted by a 2^^ -state NFA. 

Unfortunately, this expensive upper bound is tight in the order of magni- 
tude. Currently it is open whether the exact upper bound is necessary in the 
worst case. Basically, the idea is to construct an efficiently acceptable language 
such that nondeterminism cannot do anything for an efficient acceptance of its 
complement. 

Theorem 5. For any integer n > 2 there exists an n-state NFA A such that 
any NFA that accepts the complement of L{A) needs at least 2"“^ states. 

Proof. For fc > 0 let Lfc = {a,b}*a{a,b}’^b{a,b}* . It is clear that Lk is accepted 
by the (fc -I- 3)-state NFA depicted in Figure [TJ 

Intuitively, A has to guess the position of an input symbol a which is followed 
by k arbitrary input symbols and a symbol b. In order to accept the complement 
of Lk a corresponding NFA B = {S', {a, b}, S', Sq, F') has to verify that the input 




Fig. 1. A (A: -I- 3)-state NFA accepting {a,b}*a{a,b}’‘b{a,b}* . 
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Fig. 2. A minimal NFA accepting L2 of Theorem 



has no substring a{a, b}'^b. Therefore, after reading a symbol a B must be able 
to remember the next k input symbols. Altogether this needs 2^+^ states. 

More formally, we consider the input words of length k + 1. Observe that for 
each of these words w the concatenation ww belongs to the complement of Lk- 
Let S{w) be {s e S" I s G A'(sq,w) A A'(s,w) H F' ^ 0}, and v,v' be two 
arbitrary different words from {a, Assume S(v) n S(v') 7 ^ 0. It follows 
A'{sq,vv') n F' 7 ^ 0 and A'{sq,v'v) n F' 7 ^ 0 and, therefore, vv' and v'v are 
accepted by B. 

But this is a contradiction since there exists a position l<p<A:+lat 
which V has a symbol a and v' a symbol b or vice versa. Thus either vv' or v'v 
is of the form xi ■ ■ ■ Xp-iaXp+i ■ ■ ■ Xk+iyi • ■ ■ j/p-i&j/p+i • • • yk+i and, therefore, 
belongs to Lk- From the contradiction follows S{v) n S{v') = 0. Since there exist 
2^+1 words in {a, 6}^+^ the state set S' has to contain at least 2^+^ states. □ 

The situation for finite languages over an Fletter alphabet, ^ > 2, is quite 
different, since the upper bound of the transformation to a deterministic finite 
automaton is different. In m it has been shown that states are an 

upper bound for deterministic finite automata accepting a finite n-state NFA 
language. 

Corollary 6. For any integers i,n > 1 the complement of a finite n-state NFA 
language over an i-letter alphabet is accepted by an )-state NFA. 

Note, that for i = 2 the upper bound is 0(2t). A slight modification of the 
proof of the previous theorem yields: 

Theorem 7. For any integers £ > 1 and n > 2 there exists a finite n-state 
NFA language L over an £-letter alphabet such that any NFA that accepts the 
complement of L needs at least l 7(£2 iog 2 ^ ) states. 

Proof. For £> Wet A = {oi, . . . , a^} be an alphabet. Let /c > 0 be an integer. 
A finite language Lk is defined by A>a\A^y, where 0 < j < k and y G A \ {oi}. 
The NFA depicted in Figure 0 accepts F/j with 2k -\- 3 states. Trivially is also 
accepted by an NFA with 2/c + 4 states. 

An NFA B for the complement works similar to the corresponding NFA in the 
previous proof. It need not remember fc + 1 input symbols exactly, but whether 
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Fig. 3. A (2k + 3)-state NFA accepting Lk of Theorem [7[ 



a symbol has been oi or not. Since previously we argued with words of finite 
lengths it follows immediately that B needs at least 2^+^ states. Additionally 
the length of the prefix has to be tracked. For this purpose the state set has 
to be doubled such that we have a lower bound of 2*^+^ states. Transforming 
2 = we obtain the lower bound £'° 82 « g ^ □ 

Next we are going to prove a tight bound for the remaining Boolean opera- 
tion, the intersection. 

Theorem 8. For any integers n, m > 1 let A be an m-state and B be an n-state 
NFA. Then m-n states are sujjieient and necessary in the worst ease for an NFA 
to accept the language L{A) H L{B). 

Proof. Clearly, the NFA defined by the cross-product of A and B accepts the 
language L{A) n L{B) with m • n states. 

As witness languages for the fact that the bound is reached in the worst case 
define Lk = {w G {a,b}* I ffa{w) = 0 (mod A:)} for all A; g N. An NFA that 
accepts Lk with k states is easily constructed. 

Identically, is defined to be {w G {a,b}* I ffb{u>) = 0 (mod A:)}. It 
remains to show that an NFA C that accepts Lm H for m,n > I, needs at 
least m • n states. 

Consider the input words a^td and a* td with 0 < z,i' < m — 1 and 0 < 
jA' A n — 1, and assume C = {S, A,S, sq, F) has less than m • n states. Since 
there are m ■ n such words, for at least two of them the intersection 

{s G S' I s G A{so,aAA A A(s, C F 0} C A(so, a*V) 

is not empty. This implies a* A a'^~^b'^~d G LmFLA Since either if^i' or j j' 
it follows i' + m — i ^ 0 (mod m) or j' + n — j ^ Q (mod n), a contradiction. 

□ 



4 Catenation Operations 

Now we turn to the catenation operations. In particular, tight bounds for con- 
catenation, iteration and A-free iteration will be shown. Roughly speaking, in 
terms of state complexity these are efficient operations for NFAs. Again, this 
is essentially different when deterministic finite automata come to play. For ex- 
ample, in eg a bound of {2m — 1) • 2" ^ states has been shown for the DFA- 
concatenation, and in m a bound of 2" ^ -|- 2" ^ states for the iteration. 
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Theorem 9. For any integers m,n > 1 let A be an m-state NFA and B be an 
n-state NFA. Then m + n states are suffieient and neeessary in the worst ease 
for an NFA C to accept the language L{A)L{B). 

Proof. The upper bound is due to the observation that in C one has simply to 
connect the final states in A with the states in B that follow the initial state. 

The upper bound is reached for the concatenation of the languages L(A) = 
{a™}* and L{B) = {6”}*. The remaining proof follows the idea of the proof of 
Theorem |2l □ 

In case of finite languages one state can be saved. 

Lemma 10. For any integers m, n > 1 let A be an m-state NFA and B be an 
n-state NFA. If L{A) and L{B) are finite, then m + n — 1 states are sufficient 
and necessary in the worst case for an NFA C to accept the language L{A)L{B). 

Proof. Since for finite languages A and B must not contain any cycles the initial 
state of B is not reachable after the construction of the previous theorem. Thus, 
it can be deleted what yields an upper bound of m + n — 1 states. 

As witnesses for the tightness consider the languages and 

They are accepted by m-state resp. n-state NFAs. Clearly, any NFA for the 
concatenation needs at least m -I- n — 1 states. □ 

The constructions yielding the upper bounds for the iteration and A-free 
iteration are similar. The trivial difference between both operations concerns 
the empty word only. 

Theorem 11. For any integer n > 2 let A be an n-state NFA. Then n-l-1 resp. 
n states are sufficient and necessary in the worst case for an NFA to accept the 
language L{A)* resp. L(A)'^ . 

Proof. Let A = {Sa,Aa,6a,sq,a,Pa) be an n-state NFA. Then the transition 
function of an n-state NFA C = {S, A, S, sg, F) that accepts the language L{A)^ 
is for s G S' and a G A defined as follows: <5(s, a) = <5 a(s, a) if s ^ Fa, and 
S{s, a) = 6a{s, a) U <5a(so.a, a) if s G Fa. 

The other components remain unchanged, i.e., S = Sa, sq = sg a, and F = 
Fa. 

If the empty word belongs to F(A) then the construction works fine for L(A)* 
also. Otherwise an additional state has to be added: Let Sq ^ Sa and define 
S = SaU{sq}, So = Sg, F = FaU{so}, and for s G S and a G A: S(s, a) = Sa(s, a) 
if s ^ FaU{so|, S(s,a) = Sa(s, a)USA(sg,A, a) if s G Fa, and S(s,a) = SA(sg,A,a) 

if S = Sq. 

In order to prove the tightness of the bounds for any n > 2 let 

Ln = {w G {a, b}* I #a{w) = n - 1 (mod n)} 

The language is accepted by an n-state NFA. At first we show that n-l-1 
states are necessary for C = (S, {a, b}, S, sg, F) to accept L{A)* . 
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Contrarily, assume C has at most n states. We consider words of the form a* 
with 0 < i. The shortest four words belonging to L{A)* are A, 
and It follows Sq S F. Moreover, for a"“^ there must exist a path sq b 

Si h • • • h Sn -2 b s„ where Sn & F and si, . . . , Sn -2 are different non-accepting 
states. Thus, C has at least n — 2 non-accepting states. 

Assume for a moment F to be a singleton. Then sq = s„ and for 1 < i < n — 3 
the state sq must not belong to S(si,a). Processing the input the NFA 

cannot enter sq after 2n — 2 time steps. Since a (ji L(A)* the state sq must not 
belong to <5(so, a). 

On the other hand, C cannot enter one of the states si, . . . , Sn -3 since there is 
no transition to sq. We conclude that C is either in state Sn -2 or in an additional 
non-accepting state Sn-i- Since there is no transition such that s „_2 € S(sn- 2 , cl) 
in both cases there exists a path of length n from sg to sg. But a” does not belong 
to L{A)* and we have a contradiction to the assumption |F| = 1. 

Due to our assumption jS”! < n we now have |F| = 2 and [S'] — \F\ = n — 2. 
Let us recall the accepting sequence of states for the input sg h si h 

• • • h Sn -2 b Sn- Both Sg and Sn must be accepting states. Assume s„ yf sg. 
Since belongs to L{A)* there must be a possible transition sg h si or 

Sn b Si. Thus, is accepted by s„. In order to accept there must be a 

corresponding transition from s„ to s„ or from to Sg. In both cases the input 
a" would be accepted. Therefore s„ = sg. 

By the same argumentation the necessity of a transition for the input symbol 
a from sg to sg or from sg to s„ follows. This implies that a is accepted. From 
the contradiction follows IS”! > n. 

As an immediate consequence we obtain the tightness of the bound for 
L{A)'^. In this case sg € F is not required. Thus, just one final state is nec- 
essary. □ 

The state complexity for the iterations in the finite language case is n resp. 
n — 1. Without proof we state: 

Lemma 12. For any integer n > 1 let A be an n-state NFA. If L{A) is finite, 
then n — 1 resp. n states are suffieient and necessary in the worst ease for an 
NFA to accept the language L(A)* resp. L(A)~^ . 

5 Reversal 

The last operation under consideration is the reversal. For deterministic au- 
tomata one may expect that the state complexity is linear. But it is not. In 
|15| for infinite languages a tight bound of 2" has been shown. A proof of a 
tight bound for finite languages can be found in [T]. It is of order 0(2 5) for a 
two-letter alphabet. From the following efficient bounds for NFAs it follows once 
more that nondeterminism is a powerful concept. 

Theorem 13. For any integer n > 3 let A be an n-state NFA. Then n 1 
states are sufficient and necessary in the worst case for an NFA C to accept the 
language L{A)^ . 
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Fig. 4. A (fc + 3)-state and a (A: + 4)-state NFA accepting and of Theorem ITm 



Proof. Basically, the idea is to reverse the directions of the transitions. This 
works fine for NFAs whose set of final states is a singleton or whose initial state 
is not within a loop. In general we are concerned with more than one accepting 
state and have to add a new initial state. If, in addition, the old initial state is 
part of a loop, then its role cannot be played by the new one and we obtain an 
(n + l)-state NFA. 

The language Lk = a^{a^+^}*({5}* U {c}*) for fc > 1, may serve as an 
example for the fact that the bound is reached. The {k + 3)-state NFA A that 
accepts Lk and the (fc + 4)-state NFA C that accepts are depicted in Figure|4] 

The necessity of fc + 4 states can be seen as follows. Since accepted inputs 
may begin with an arbitrary number of b’s or c’s we need two states Sb and Sc to 
process them. This cannot be done by the initial state because the loops would 
lead to acceptance of words with prefixes of the form b*c* or c*b*. 

Obviously, a loop of A: + 1 states is needed in order to verify the suffix 
. If one in this sequence would be equal to Sb (sc), then it would have 
a loop for b’s (c’s) and, hence, inputs of the form c*a*b*a^ {b*a*c*a^) would 
be accepted. For similar reasons the new initial state cannot be within a loop. 
Altogether it follows that C needs at least fc + 4 states what proves the tightness 
of the bound. □ 

The fact that NFAs for finite languages do not have any cycle leads once 
more to the possibility of saving one state compared with the infinite case. 

Lemma 14. For any integer n > 1 let A be an n-state NFA. If L{A) is finite, 
then n states are sufficient and necessary in the worst case for an NFA to accept 
the language L{A)^. 

Proof. Recall from the proof of Corollary El that for every minimal n-state NFA 
that accepts a non-empty finite language there exists an equivalent n-state NFA 
that has only one final state. By the construction of the previous proof we obtain 
an {n+ l)-state NFA that has an unreachable state. It is the unique former final 
state. The bound follows if the state is deleted. 

Let for n > 1 the language defined to be {a, 6}"“^. Trivially, is ac- 
cepted by an n-state NFA. Since the assertion follows. □ 

The bound for the reversal of finite NFA languages is in some sense strong. 
It is sufficient and reached for all finite languages. Finally, Table [T] summarizes 
the shown state complexity bounds for NFAs. 
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Table 1. Comparison of the NFA and DFA state complexities (I is the number of 
states, t is the number of final states of the ‘left’ automaton). 





1 NFA 1 


DFA 1 




finite 


infinite 


finite 


infinite 


u 


m -k n — 2 


m -k n -k 1 


0{mn) 


mn 
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(mn) 


mn 


0(mn) 


mn 
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n -k 1 
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2^ 
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m -\- n 
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Abstract. This paper impose further discipline to the use of adaptive 
automata [Jos94], [IwaOO] by restricting some of their features, in order 
to obtain devices that are easier to create and more readable, without 
loosing computational power. An improved notation is proposed as a 
first try towards a language for adaptive paradigm programming. 

Keywords: adaptive devices, rule-driven formalisms, self-modifying ma- 
chines, adaptive automata, adaptive paradigm. 



1 Introduction 

In [JosOl], the structure and the operation of adaptive rule-based devices have 
been formally stated. Structured pushdown automata [Jos93] are a variant of 
classical pushdown automata, in which states are clustered into mutually recur- 
sive finite-state sub-machines, which restrict the usage of the control pushdown 
store to the handling of return states only. Adaptive automata are self-modifying 
rule-driven formalisms whose underlying non-adaptive devices are the structured 
pushdown automata. Structured pushdown automata are fully equivalent to clas- 
sical pushdown automata. 

However, despite these features, adaptive automata sometimes lack simplic- 
ity, turning them difficult to understand and maintain. 

The proposal described in this paper imposes some restrictions to the use of 
the features of the model in order to obtain devices that are easier to create and 
understand, without loosing any of the their original computational power. 



2 Adaptive Automata 

In order to Adaptive automata perform self-modification, adaptive actions at- 
tached to their state-transition rules are activated whenever the transition is 
applied. 

The Underlying Structnred Pushdown Automata. A finite-state automa- 
ton is composed of a set of states, a finite non-empty alphabet, a transition 
function, an initial state and a set of final states. Transitions map ordered pairs 
specifying the current state and the current input symbol into a new state. 
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There are two types of transitions from state A to state B: 

(a) Transitions {A, a) —>■ B, which consume an input symbol a; and 

(b) Empty transitions (A,e) —>■ B, which do not modify the input. 

A structured pushdown automaton also exhibits a set of states, a finite non- 
empty alphabet, an initial state, a set of final states, a pushdown alphabet and a 
transition function, including internal transitions, like those shown for finite-state 
automata, and external transitions, responsible for the calling and returning 
scheme. Beside the two types of internal transitions, sub-machines allow special 
call and return transitions: 

(a) Transitions {A, e) (J, B, X) from state A, calling a sub-machine whose 
initial state is AT. B is the return state, to which the control will be passed upon 
a return transition is performed by the called sub-machine. B is pushed onto the 
pushdown store when these transitions are executed. 

(b) Transitions (C, e) — > (t B, B) from some state C in the current sub-machine’s 
set of final states. State B, which represents any state tha has been previously 
pushed onto the pushdown store by the sub-machine that called the current one, 
is popped out of the pushdown store and the caller sub-machine is then resumed 
at the popped state. 

The Adaptive Mechanism. Adaptive actions change the behavior of an adap- 
tive automaton by modifying the set of rules defining it. In adaptive automata, 
the adaptive mechanism consists of executing one adaptive action attached to 
the state transition rule chosen for application before the rule is performed, and 
a second one after applying the subjacent state transition rule. 

The adaptive mechanism of adaptive automata is described in [JosOl]: it is 
defined by attaching a pair of (optional) adaptive actions to the subjacent non- 
adaptive rules defining their transitions, one for execution before the transition 
takes place and another for being performed after executing the transition. 

At each execution step of an adaptive automaton, the device’s current state, 
the contents of the top position in the pushdown storage and the current in- 
put symbol determine a set of feasible transitions to be applied. In deterministic 
cases, the set is either empty (no transition is allowed) or it contains a single tran- 
sition (in this case, that transition is immediately applied). In non-deterministic 
cases, more than one transition are allowed to be executed in parallel. In sequen- 
tial implementations, a backtracking scheme chooses to apply one among the set 
of allowed transitions. 

Adaptive actions are formulated as calls to adaptive parametric functions. 
These ones describe the modifications to apply to the adaptive automaton when- 
ever they are called. These changes are described and executed in three sequential 
steps: (a) An optional adaptive action may be specified for execution prior to 
applying the specific changes to the automaton, (b) A set of elementary adap- 
tive actions specifies the modifications performed by the adaptive action being 
described, (c) Another optional adaptive action may performed after the specific 
modifications are applied to the automaton. 

Elementary adaptive actions specify the actual modifications to be imposed 
to the automaton. Changes are performed through three classes of adaptive 
actions, which specify a transition pattern against which the transitions in use 
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are to be tested: (a) Inspection-type actions (introduced by a question mark in 
usual notation), which search the current set of transitions in the automaton for 
transitions whose shape match the given pattern (b) Elimination-type adaptive 
actions (introduced by a minus sign in usual notation), which eliminate from the 
current set of transitions in the automaton all transitions matching the given 
shape, (c) Insertion- type adaptive actions (introduced by a plus sign in usual 
notation), which add to the set of current transitions a new one, according to 
the specified shape. 

The adaptive mechanism turn a usual automaton into an adaptive one by 
allowing its set of rules to change dynamically. 

3 Improving the Formulation of Adaptive Automata 

In this section we discuss some of the main drawbacks of the traditional version of 
the formalisms used for representing adaptive automata in previous publications. 

The Notation. The notation used to represent adaptive automata is the first 
source of drawbacks to be considered in our study, for the simplicity of the 
model relies on the use of notations with the adequate features: a good notation 
is expected to be at least compact, simple, expressive, unambiguous, readable, 
and easy to learn, understand and maintain. 

The notations for adaptive automata and structured pushdown automata 
generally differ in details, but there are two main classes of notations: graphical 
ones, are better for human visualization, and symbolic ones, which are more 
compact and machine-readable. 

We compared notations still in use, and chose an algebraic and a graphical 
one, according to their characteristics and functionality: 



Transition type 


Symbolic notation 


Graphical notation 


Transition consuming a 


{A, a) B 




Empty transition 


{A,e)^B 




Initial state 


Explicitly indicated 


-0 


Final state 


Explicitly indicated 


G 



For structured pushdown automaton, the final choice preserves the nota- 
tion established for finite-state automata to denote internal transitions in sub- 
machines. The symbol e (the empty string) has been preserved in both finite- 
state and structured pushdown notations in order to maintain compatibility 
with traditional well-established notations. The following table adds notation 
for expressing (empty-transition) sub-machines calls and returns: 



Transition type 


Symbolic notation 


Graphical notation 


Call sub-machine X from 
state A, returning to state B 


{A,e)^{iB,X) 


0^^ o'" 


Return to state R after 
executing the called sub- 
machine X in its final state C 


(C,e)^(T R,R) 
for all possible R 


X c 

— ® 
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For adaptive automata, all transitions not calling adaptive actions are de- 
noted as stated above. Adaptive transitions make reference up to a pair of adap- 
tive actions, B (before-action) and A (after-action). Their notation is summa- 
rized in the table below. 

A restriction on the transitions to which adaptive actions may be attached, 
restricts them to be attached to internal transitions only avoiding superposition 
of the effects of two different sources of complexity in the same transition rule. 



Transition type 


Symbolic notation 


Graphical notation 


Adaptive transition with “before” 
adaptive action attached 


{A, a) B[B^] 


O'" 


Adaptive transition “after” 
adaptive action attached 


{A, a) 


0^^ O'" 


Adaptive transition with both 
adaptive action attached 


{A, a) B[B^A] 





In the general case, adaptive actions B and A are representations of paramet- 
ric calls to adaptive functions, which have the general form M {pi,P 2 , ■ ■ ■ ,Pn) 
where pi,p 2 , ■ ■ ■ ,Pn are n arguments passed to an adaptive function named M. 

Adaptive actions are symbolically declared apart from the adaptive automa- 
ton, and they comprehend a header and a body. In the header, the name and 
the formal parameters of the adaptive function are defined, followed by a section 
in which the names of all variables and generators are declared. 

The body part is formed by an optional adaptive function call to be executed 
on entry, followed by a set of elementary adaptive actions, responsible by the 
modifications to be performed. A furhter optional call specifies another adaptive 
function to be executed on exit. Both calls are denoted in the usual way, as 
mentioned above. 

There are three types of elementary adaptive actions: insertion, elimination 
and inspection actions. The notation chosen for the elements of the declaration 
of an adaptive function is shown in the table below. No graphic notation is yet 
suggested for adaptive functions. 



Element of the declaration 


Symbolic notation 


Adaptive function name 


M 


Parameters 


(pi,P2,...,p„) 


Variables 


Vl,V2, ■ ■ - ,V„ 


Generators 


9l 5 92 ? • • • 5 9n 


Inspection actions 


? [ pattern ] 


Elimination actions 


- [ pattern ] 


Insertion actions 


-1- [ pattern ] 


Pattern 


Any transition 



Graphical notations have been tried [Alm95] that showed to be effective in 
some particular cases, where the self-modifications to be performed are small and 
easily visible. It is difficult to represent graphically the operation of adaptive 
functions in their full generality. We chose to adopt symbolic descriptions for 
adaptive functions, even when graphical notation is used to describe the adaptive 
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automata they refer to. In order to provide an acceptable notation for dealing 
with sets and predicates, we chose to adopt the usual notation of predicate 
calculus, including quantifiers, for expressing adaptive functions. 

The Underlying Model. The underlying model for adaptive automata is the 
structured pushdown automaton. Sub-machines may be considered as improved 
finite-state automata that are allowed to recursively call each other. The best 
feature of this arrangement is that structured pushdown automata turn out to 
be easier to design and understand than pushdown automata. 

Pushdown machines may be considered excessively complex for some applica- 
tions, for which finite-state automata have being used successfully. In these cases, 
some means should be provided to avoid the presence of unnecessary features in 
the underlying non-adaptive automaton. 

In the special cases for which a simple finite state mechanism is enough, we 
may suppress the pushdown storage from the notationm reducing the remaining 
sub-machine into a simple finite-state machine. 

The device resulting from the suggested simplification becomes an adaptive 
finite-state automaton, and may be formally stated just as published before in 
section 4 of [JosOl]. 

The Adaptive Mechanism. The principle of this mechanism consists in mod- 
ifying the set of rules of the adaptive automaton by performing two adaptive ac- 
tions, one before and another one after executing the underlying state-transition 
rule. However, one may ask whether a pair of adaptive actions is really a need. A 
element that substantially contributes to harden understanding adaptive devices 
is the structure of the adaptive functions themselves. 

Adaptive functions are allowed to perform a pair of adaptive actions, one 
before and another after the modifications the adaptive function is expected to 
perform. 

The set of parameters allowed in adaptive functions is another feature that 
is questionable when we search for simplicity: is it really needed to allow an 
arbitrary number of parameters? Should it be better to limit the number of 
parameters to a minimum? What should this minimum be? Should adaptive 
functions have no parameters at all? In the case of allowing parametric adaptive 
functions, should parameter types be controlled instead of arbitrarily chosen? 
Should adaptive functions be allowed as parameters? 

Elementary adaptive actions are another source of complexity, since no re- 
strictions are imposed to their use. Some questions may be posed concerning 
these elements of the formulation of adaptive automata: Should multiple vari- 
ables be allowed in inspecting and eliminating elementary adaptive actions? 
Should looping be allowed within elementary adaptive actions? 

In the following text we propose answers to several questions posed here, 
with the intent of achieving for adaptive automata a formulation according to 
our simplicity goals. 

Adaptive Actions. In adaptive automata, this adaptive mechanism consists 
of executing the pair of adaptive actions attached to a rule at the time it is chosen 
for application. The first adaptive action in the pair is executed before the rule is 
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performed, while the second one is executed after applying the subjacent state- 
transition rule. We limit to one the number of adaptive actions attached to the 
corresponding subjacent rule. Indeed, [IwaOO] shows a proof for the following 
theorem, stating that there is no need of attaching a pair of adaptive actions per 
rule, but the use of a single one is enough. 

Theorem 1. The result of the execution of any adaptive action is equivalent to 
a non-empty sequence of elementary adaptive actions. 

Proof. This theorem is fully demonstrated in [IwaOO] by simulating the device 
with the specified restrictions. □ 

Further simplifications may be achieved by using the following theorem: 

Theorem 2. For each rule defining an adaptive automaton, there is an equiva- 
lent set of rules, all of which have at most one attached (after-) adaptive action. 

Proof. This theorem may be proved by using the previous one and by showing 
that each adaptive transition having an attached before-adaptive action may 
be decomposed into a sequence of two simpler ones: the first one is an empty 
transition having as its attached after-adaptive action the before-adaptive action 
in the original transition; the second one is a copy of the original transition, from 
which the before-adaptive action has been removed. □ 



Theorem 3. For each adaptive function that calls a pair of attached adaptive 
actions there is an equivalent set of simpler adaptive functions, all of which have 
at most one attached (after-) adaptive action. 

Proof. The proof of this theorem follows from the result of the previous one, and 
is based on showing that each adaptive function F that calls a before-adaptive 
action B may be decomposed into a sequence of two simpler ones in the following 
way: F becomes a copy of B’s body, followed by a call to an auxiliary function Fi, 
where Fi is a copy of F from which the call to B has been removed. □ 



4 Improving the Formulation of the Underlying Model 

Structured pushdown automata use their pushdown store in two extremely lim- 
ited situations, in which no symbol is consumed from the input string: (a) when 
a sub-machine is called, the return state is pushed onto the pushdown store be- 
fore control is passed to the starting state of the sub-machine being called, and 
(b) when a sub-machine finishes its activity, a return state is popped from the 
pushdown store, and a return is made to that state in the calling sub-machine. 

In this section we propose some changes to the underlying structured push- 
down automata. These suggestions rely on the following theorem. 

Theorem 4. Adaptive (structured pushdown) automata are equivalent to adap- 
tive finite-state automata. 
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Proof. Proving that an adaptive structured pushdown automaton may simulate 
an adaptive finite-state automaton is straightforward. The opposite clause of the 
theorem is proved by simulating the pushdown store with states and transitions 
of adaptive finite-state automata. The resulting model allows performing the 
same work without using explicit memory. Such simulation may be sketched in 
three steps: 

— We can suppose, without lost of generality, that all sub-machines in the 
adaptive automaton have single initial and final states. 

— For sub-machines that are not self-embedded, substituting all sub-machine 
calls by an equivalent empty adaptive transition is enough. Indeed, if a calling 
sub-machine N invokes submachine M, we replace the following adaptive 
transition in N: 



for the original sub-machine call: 

QP 

where e is the empty word and •A(q) is an adaptive function which performs 
a macro expansion of the call to sub-machine M by replicating its topology 
in the exact place where the former call was, in such a way that the initial 
state of submachine M is reached, by a unique empty transtition from state 
q and the state p is reached by a unique empty transition from the unique 
final state of M. Note that the adaptive function uA is executed only once. 
Such replacements finishes in a finite number of steps since, by hyphotesis, 
there are no self-embedding in M . 

— If any sub-machine in the adaptive automaton is self-embedded, then it 
suffices to substitute every self-sub-machine call (top of the following table) 
by the adaptive transition (bottom of the table). Rhombuses represent the 
body of sub-machine M: 
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Where, again, the e is the empty word but now »A is an adaptive function 
with the following effect shown in the last box. 

Note that this adaptive function duplicates the sub-machine topological 
structure and the states qo, qi, q and p. It must be clear, also, that this process 
can be a non-stopping one, but anyway it is a safe way to simulate the adaptive 
(structured pushdown) automata with an adaptive finite-state automata. □ 



Improving the Formulation of the Adaptive Mechanism. The adaptive 
mechanism is indeed an important source of complexity in the formulation of 
adaptive automata which may be simplified in several aspects, some of which 
are the following: (a) by limiting to one the number of adaptive actions attached 
to each rule and/or called inside adaptive functions (b) by restricting the na- 
ture and number of parameters allowed for adaptive functions (c) by avoiding 
multiple variables to be inspected at the same time in a single inspection (d) 
by avoiding loops within elementary adaptive actions (e) by avoiding adaptive 
functions passed as parameters. Unfortunately, if we impose too much simplifi- 
cations to the formulation, it becomes less expressive, requiring more clauses to 
perform the desired effect. However, by restricting the formulation, simpler facts 
are expressed by each adaptive action, rendering the formulation easier to un- 
derstand. The hints above surely help searching for a cleaner and more effective 
formulation. This is a challenge yet to be overcome. 



5 Illustrating Example 

In this section, we chose a Non-deterministic Adaptive Finite Automaton, and 
used it to solve the well-known string-matching problem of determining whether 
a given string is a sub-string of some text. One classical solution for this problem 
is as follows: (a) Create a non-deterministic finite-state automaton that solves 
the problem. Constructing such automaton is straightforward: at its initial state, 
a loop consumes any symbol in its alphabet; next, a simple path consuming the 
sequence of symbols in the string we are looking for, and, at the end of this 
path, a unique final state consumes any further alphabet symbols. The explicit 
non-determinism in this automaton is located at the beginning of the path that 
accepts the required pattern, (b) Use a standard method to eliminate the non- 
deterministic transitions in the automaton, (c) Use a standard algorithm to 
minimize the resulting deterministic automaton. 

In [HolOO] an algorithm is presented that constructs directly the desired de- 
terministic finite-state automaton; it has the advantage of eliminating the need 
to eliminate non-deterministic transitions, since the complexity of this process 
is exponential. Additionally, in this method the full automaton must be con- 
structed and minimized a priori. 

Our approach avoids the exponential transformation, and does not re- 
quire any unnecessary a priori work: we start from an initial adaptive non- 
deterministic finite-state automaton and let it process a text sample; whenever 
a text being analyzed activates a non-deterministic transition, the execution of 
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the corresponding adaptive function performs the required topological transfor- 
mations in order to render it deterministic. By doing so, our approach uses an 
incremental strategy to force them to perform in a deterministic way all transi- 
tions they execute, without changing the remaining non-deterministic transitions 
present in their formulation. 

5.1 An Exact String Matching Non-deterministic FSA 

Here we follow [HolOO]; the straightforward non-deterministic finite-state au- 
tomaton is constructed for accepting the pattern aba, over the alphabet S = 
{a,b} : 



<D — — <D — ® 



Fig. 1. A non-deterministic finite-state automaton 




Fig. 2. Eliminating non-deterministic transitions. 



5.2 An Equivalent Adaptive FSA for Exact String Matching 

Now, let us turn the attention to our adaptive approach. In figure 2 (left) a non- 
determinism is present. In order to remove such non-determinism, we introduce 
a new state (fig. 2, right). 

Now, in order to make it reachable, from the newly created state, all states 
that were reachable through all transitions departing from the conflicting states, 
we add further transitions leaving the new state and arriving to the target states, 
consuming the appropriate symbols, as shown in fig. 3. These operations may 
be sketched as an adaptive function B, in fig. 4. 

This adaptive function receives a state and a token as parameters. In this 
formulation, it declares three variables and one generator; in line 3 quantifiers are 
used to allow testing whether there is more than one transition departing from 
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Fig. 3. Inserting transitions connecting the new state to the automaton 



1 


B{q,o) = { 


declaring name and 
parameters of B 


2 


p,n*,r,a\ 


declaring variables 
(n is a generator) 


3 


(!p)(?[(g,o-) ^ p]){ 


only if there are more than 
one rule with this shape, 


4 


+[('?.cr) ^ n] 


add a new transition with 
his shape with destination n 


5 


(Vp)(?[(<7,u) ^p]){ 


for all transitions emerging 
from state q 


6 


(Va)(Vr)(?[(p,a) ^ r]){-|-[(n, a) ^ r]} 


insert corresponding 
transitions departing from n 


7 


-Us)^p]}}} 


after all insertions, remove 
the original transition 



Fig. 4. Adaptive function B that dynamically eliminates non-deterministic transitions 
from our non-deterministic adaptive hnite-state automaton 



the state received as the first parameter, and consuming the symbol received 
as the second parameter; if the answer is negative, then the query (the clause 
introduced by a question mark in this notation) will produce an empty result, 
and in this case the clause in braces (comprehending lines 4, 5, 6, 7) will not 
be executed. Otherwise, in line 4 a new transition is created from the current 
state to a new one, by means of generator n. Line 5 states that for each output 
transition, a proper output transition is generated and the original transition 
is deleted. The resulting adaptive non-deterministic finite-state automaton is 
shown in figure 5. 
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2,«^(q,a) 



Z,»(!?(q,o) 






Fig. 5. Adaptive non-deterministic finite-state automaton 



6 Conclusions 

The proposed simplification to the formulation of adaptive automata seems to 
be reasonably expressive, compact and readable, allowing them to be stated in 
a rather intuitive form. Most features and development practices already estab- 
lished are preserved and respected to a large extent. 

The proposed formulation caused almost no impacts to the power of adaptive 
automata, so the net result of its use will be a significant increase in the readabil- 
ity and soundness of the formulation without loss of the devices computational 
power. 

From the programming point of view, our proposal has also advantages over 
the earlier notations, allowing programmers to build and debug adaptive au- 
tomata in a more expedite way, resulting better products and a far better doc- 
umentation. 
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Abstract. This paper studies practical algorithms for dealing with a 
particular family of cellular automata, which has recently been proved 
computationally equivalent to linear conjunctive grammars. The relation 
between these grammars and these automata resembles that between reg- 
ular expressions and finite automata: while the former are better suited 
for human use, the latter are considerably easier to implement. In this 
paper, an algorithm for converting an arbitrary linear conjunctive gram- 
mar to an equivalent automaton is proposed, and different techniques of 
reducing the size of existing automata are studied. 



1 Introduction 

Linear conjunctive grammars [2j are linear context-free grammars augmented 
with an explicit intersection operation. The language family they generate is 
closed under all set-theoretic operations jS] and is known to contain many clas- 
sical non-context-free languages, such as {a"'6"c” | n ^ 0}, {wcw \ w G {a,b}*} 
and the language of all derivations in a given finite string rewriting system [^, 
as well as other interesting languages, such as {ba^ba^b . . . \ n ^ 0}. 

In spite of their increased generative capacity, linear conjunctive grammars 
can be parsed with virtually the same quadratic-time methods as linear context- 
free grammars j^. In an attempt to generalize these recognition methods, the 
paper investigated a certain simple family of automata and established their 
computational equivalence to linear conjunctive grammars. These automata are 
basically one-dimensional cellular automata with the tape gradually shrinking in 
course of the computation; they can be simulated in (n^ -I- n)/2 -|- C elementary 
table lookup operations, which makes them very suitable for practical use. How- 
ever, the grammar to automaton construction methods of |1] were mainly aimed 
at establishing the computational equivalence of the two formalisms and are 
very inefficient in the sense of the number of states generated; additionally, they 
require the grammar to be in a certain normal form, which is also inconvenient. 

In this paper we develop a new practical algorithm to construct an automa- 
ton out of an arbitrary grammar and study various techniques of reducing the 
number of states in these automata. The language {wcw | w G {a, &}*} is used to 
illustrate the methods of this paper, which allow to convert the original gram- 
mar for this language [21 to an automaton of 158 states and then reduce this 
automaton to 35 states, while the methods of [1] yield 222 states. 
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The algorithms developed in this paper have been implemented in the parser 
generator for conjunctive grammars jS]. 

2 Preliminaries 

2.1 Linear Conjunctive Grammars 

Conjunctive grammars were introduced in |2] as an extension of context-free 
grammars with an explicit intersection operation. 

Definition 1. A conjunctive grammar is a quadruple G = {S, N, P, S), where 
S and N are disjoint finite nonempty sets of terminal and nonterminal symbols; 
P is a finite set of grammar rules of the form 

A ai& . . . &a„ {A € N; n ^ 1; for all i, G (if U N)*), (1) 

where the strings ai are distinct and their order is considered insignificant; 
S € N is a nonterminal designated as the start symbol. 

For each rule of the form {!]) and for each i (1 ^ i ^ n), A ai is called a 
eonjunct. Let conjuncts(P) denote the sets of all eonjuncts. 

A conjunctive grammar generates strings by deriving them from the start 
symbol, generally in the same way as the context-free grammars do. Intermediate 
strings used in course of a derivation are defined as follows: 

Definition 2. Let G = {E,N,P,S) be a eonjunetive grammar. The set of eon- 
junctive formulae T C (if UfVU{‘(’, ‘)’})* *■5 defined inductively: (i) The 

empty string e is a formula; (ii) Any symbol from if U is a formula; (Hi) Lf A 
and B are nonempty formulae, then AB is a formula, (iv) If A\, . . . ,An {n ^ 1) 
are formulae, then {AiSz . . . SzAn) is a formula. 

There are two types of derivation steps: 

1. A nonterminal can be rewritten with a body of a rule enclosed in parentheses 

- s'As" s'(o!i& . . . &Q!„)s", if A — > oi& . . . &a„ G P and s' As" G P, 

where s',s" G (if UiVU{‘(’, 

2. A conjunction of one or more identical terminal strings enclosed in paren- 

theses can be replaced with one such string without the parentheses - 
s'{wh . . . hw)s" s'ws", if s' As" G P. 



Definition 3. Let G = {E, N, P, S) be a eonjunetive grammar. The language 
of a formula is the set of all terminal strings derivable from the formula: 
Lg{A) = {wGE*\A w}. Define L{G) = Lg{S). 

Let us now restrict general conjunctive grammars to obtain the subclass of 
linear conjunctive grammars: 
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Definition 4. A conjunctive grammar G = {S, N, P, S) is said to be linear, if 
each rule in P is of the form 

A UiBiViSz . . . hUmBmVra {nT- ^ 1) Ui, Vi G E* , Bi G N) (2a) 
A^w (wGE*) (2b) 

It has been proved in that every linear conjunctive grammar can be effec- 
tively transformed to an equivalent grammar of the following form: 

Definition 5. A linear conjunctive grammar G = {E, N, P, S) is said to be in 
the linear normal form, if each rule in P is of the form 

A bBiEz . . . EzbBm^Gich . . . SzCnC (m, n ^ 0; 

TO -I- n ^ 1; Bi, Gj G N] b,c G E), 

A^a {Ag N,aG E), 

S —>■ e, only if S does not appear in right parts of rules 

Example 1. The following linear conjunctive grammar for the well-known non- 
context-free language {wcw \ w G {a, 5}*}, quoted from [2], will be used as an 
example throughout this paper: 

S^GkD 

G ^aGa\ aGb \ bGa \ bGb \ c 
D aAkaD \ bBkbD \ cE 
A — > aAa \ aAb \ bAa \ bAb \ cEa 
B —>■ aBa \ aBb \ bBa \ bBb \ cEb 
E —>■ aE I bE I e 



(3a) 

(3b) 

(3c) 



2.2 The Corresponding Automata 

Let us give a definition of a particular family of one-dimensional cellular au- 
tomata, which is known to be equivalent to linear conjunctive grammars [1]. 

Definition 6. An automaton is a quintuple M = {E,Q, I, S, F) , where E is the 
input alphabet, Q is a finite nonempty set of states, I : E Q is a function that 
sets the initial states, 6 : Q x Q Q (a binary operator on Q) is the transition 
function, and F <G Q is the set of final states. 

An automaton M = {E,Q, I ,S, F) takes a nonempty string w = oi . . . a„ 
(oi G E, n ^ 1) as an input, converts it to the string of states /(oi) . . ./(a„) 
(see Figure IHa)) and then proceeds to constructing new strings of states out 
of existing strings of states by replacing a string of the form qi . . . with the 
string S{qi,q 2 ),S{q 2 ,q 3 ), ■ ■ ■ , S{qm-i,qm), as shown in Figure[T](b). This is being 
done until the string of states shrinks to a single state; then the string is accepted 
if and only if this single state belongs to the set F. 

Now let us formally define the computation of an automaton. 
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(a) 




(b) 





q'i q2 ■ 


qm-i 














Qiq2q3 • 


qm-iqm 



Fig. 1. One step of computation. 



Definition 7. An instantaneous description (ID) of an automaton 
{E,Q,I,6,F) is an arbitrary nonempty string over Q. The successor of an ID 
9 i 92 ■ ■ - <ln (where n ^ 2), denoted as S(qiq 2 . . . qn), is the ID q[q 2 . . . such 

that q[ = 5{qi,qi+i) for all i. A sequence of IDs . . . ,a„ (n ^ 1) is called a 
computation of the automaton if ai+i = S(ai) for alii (1 ^ i < n) and |a„| = 1. 

Note that the successor of an ID is uniquely determined, and therefore the 
automaton is deterministic and the computation starting from a definite string 
of states has a definite outcome: 

Definition 8. For each ID gi . . . (n ^ 1), denote the outcome of the compu- 

— 

tation starting from as A{qi . . . qn) =5 {qi ■ ■ ■ qn) & Q- 

It is left to define the initial ID of the automaton on the given input: 

Definition 9. Let M = {E,Q, 1,6, F) be an automaton. For each string w = 
fli . . . a„ G , define I{w) = I{a\) . . . I{a„). The computation of the automaton 
M on the string w is the computation starting from the ID I{w). The string w 
is accepted iff A(I(w)) G F: the lanquaqe accepted by the automaton is defined 
as L{M) = {w\wGE+, A{I{w)) G F}. 

One evident limitation of these automata is their inability to accept or reject 
the empty string; however, this is only a technical limitation which does not 
affect their generative power on longer strings. 

Let us quote two theoretical results obtained in [1] , which show computational 
equivalence of linear conjunctive grammars and these automata. 

Theorem 1. For every linear conjunctive grammar G = {E, N, P, S) there ex- 
ists and can be effectively constructed an automaton M = {E,Q,I,6,F), such 
that L{M) = L{G) (mod A'+). 

Theorem 2. For every automaton M there exists and can be effectively con- 
structed a linear conjunctive grammar G, such that L{G) = L{M). 



3 Construction of Automata 

In this section we review the construction that was used in [3j in the proof of 
the computational equivalence of automata and grammars, and then generalize 
this construction to obtain a practical method of converting an arbitrary linear 
conjunctive grammar to an equivalent automaton. 
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3.1 A Method for Grammars in the Linear Normal Form 

The idea of construction is to simulate the recognition algorithm for linear 
conjunctive grammars in the linear normal form developed in |2]. In order to 
simulate the computation of this algorithm, each state of the automaton “re- 
members” the set of nonterminals that derive the corresponding substring, as 
well as the first and the last symbols of this substring. 

Let G = (E, N, P, S) be a linear conjunctive grammar in the linear nor- 
mal form. We construct the automaton M = M{G) = {E,Q,I,6,F), where 
Q = E X 2^ X S, I{a) = (a,{A \ A ^ a G P},a), (5((6, i?i, &'), (c', i? 2 , c)) = 
{h, {A\3A ^ bBiSz . . .bBjn&zCicSz . . .SzGnC : Vi, j i?i S i? 2 , Q G i?i}, c) and 
F = {(o, R,b)\a,bGE, RCN, S G R}. 

The correctness of this construction is stated in the following lemma jl]: 

Lemma 1. Let w G E~^ be an arbitrary nonempty string and let A(I(w)) = 
(b,R,c). Then, for each nonterminal Ag N, A =^* w if and only if A G R. 

A practical implementation of this construction can define Q to be the mini- 
mal subset of E X 2^ X E that contains I{a) for all a G E and 6{q' , q") for every 
two states q' ,q" in this set. This can be computed as follows: 

let Q = {I{a)\a G E} /* use I as defined above */ 

while new states can be added to Q 
for all q', q" G Q 

add 6{q', q") to Q /* use J as defined above */ 

As we shall show later in Section IH this construction still can create many 
superfluous states, but in the practical cases it yields much less states than 
|A’|2 • given by the straightforward construction. 

For instance, the grammar from Example [T] after being transformed to the 
linear normal form, has 13 nonterminals, which gives an upper bound of 3-2^^-3 = 
73728 states, meaning the table 6 of more than five billion entries. On the other 
hand, the given construction method yields only 222 states, and, consequently, 
S table of around fifty thousand entries. 

3.2 A Method for Arbitrary Grammars 

The goal of the construction is again an automaton that is able to compute 
the sets of nonterminals deriving each substring. However, the conjuncts can 
now be of the form A — > uBv or A w for strings u and v of unbounded 
length, while the automata still operate only with pairs of states that are apart 
only by one input symbol, and we still have to compute A{I{a\ . . . qn)) out of 
A{I{ai . . . qn-i)) and A{I{a 2 ■ ■ ■ qn)) only. 

The proposed solution is to fix some base set of strings over E 11 N that we 
are interested in, such that the subset of those of them that derive some terminal 
string bwc could be determined solely by the symbols b, c and by the knowledge 
of which strings from this base set derive the terminal strings bw and wc. Once 
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this collection is constructed, this functional dependency could be encoded in 
the definition of the transition function S. 

The simplest base set satisfying this condition is the set of all nonempty 
substrings of {a | A — > a € conjuncts{P)}, but it in fact contains numerous 
superfluous strings. For instance, if we want to recognize the string bac, then 
keeping track of both ba and ac is not needed. We shall now propose one possible 
more efficient method of constructing the base set of strings. Given an arbitrary 
linear conjunctive grammar G = {E, N, P, S), let us define the set of incomplete 
strings Items C E'^UE*NS* as Items = {S'}U/tems„BU/teTOS„Bi,U/temSu,, 
where 

ItemsuB = [J {yB \ 3x & S* xy = u, \yB\ < \uBv\}, (4a) 

A — >uBv^conjuncts{P) 

ItemSuBv = [J {uBx \ 3y ^ E* •. xy = v, \uBx\ < \uBv\}, 

A — >uBv(^conjuncts{P) 

ItemSw = [J {x \ 3y G E* : xy = w, |a;| < |w|} (4c) 

A — >w^conjuncts{P) 



Let us also define the following augmented version of this set: Items = Items U 
{a I a yf e, A — > a S conjuncts(P) for some A € N}. 

Lemma 2. Let a G E. Denote Nullable = {A | A G TV, A e}. Then, 

({a} U {a} • Nullable U Nullable • {a}) n Items = 

(5) 

{a\a G Items, a =>* a, a ^ N} 



Lemma 3. Let bwc (b,c G E , w G E* ) he an arbitrary string of length 2 or 
more. Let Xi = {a \ a G Items, a =>* bw}, X 2 = {a\a G Items, a =>* wc}. 
Then, 

({6} • X 2 U X\ ■ {c}) n Items = {a \ a G Items, a =>* bwc, a ^ N} (6) 

Once the set (0 is computed, the set of nonterminals that derive the string 
bwc can be determined by the following closure operation: 

Definition 10. Let G = {E, N, P, S) be a linear conjunctive grammar, let R C 
{E U N)~^ . Define the set closure(R) as the least subset of RU N that includes 
R and contains every A G N , such that there is a rule A — > ai& . . . feom G P, 
for which ai G closure(R) for all i. 



Lemma 4. If Xnitems = {a\a G Items, a =>* w, a ^ N} for some w G E~^ 
and X C (if U N)~^ , then 



w} 



closure(X) n Items = {a \ a G Items, a 



( 7 ) 
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Now it suffices to write down the mentioned results in the form of a construc- 
tion. Define the new automaton as {S,Q,I,5,F), where Q = S x x S, 

the initial states are defined as 

J(a) = (a, dosure({a} U {a} • Nullable U Nullable • {a}) n Items, a) (8) 
for all a G S, the transition function S is defined as 

S{{b, Xi,d), {b' , X 2 , c)) = (6, closure{b ■ X 2 U Xi • c) fl Items, c) (9) 
for every {b, Xi,c'), {b' , X 2 , c) G Q, and the set of final states is 

F = {{b,x,c)\{b,x,c)eQ, s ex} (10) 



Lemma 5. For each string w G , let A{I{w)) = (b,R,c). Then, for every 
incomplete string a G Items, a w if and only if a G R. 

Consider the grammar for {wcw \ w G {a, 6}*} from Example [T] The set 
Items equals {S,C, D, A, B, E,aC,aA,aB,bC,bA,bB,cE}. The cardinality of 
E X X T" is again 3 • 2^^ • 3 = 73728. However, if we rule out obviously 

unreachable states in the same way as suggested in Section [ffiTl the algorithm 
will construct the following 158 states: 



(a, 


0, a) 


{a. 


{C,E,aA}, 


a) 


{b, 


{D,A,bB}, 


a) 


{a. 


{A,E,aC}, 


b) 


(a, 


0, b) 


{a, 


{C,E,aA}, 


b) 


(b, 


{D,A,bB}, 


b) 


(a. 


{A,E,aA}, 


a) 


(a, 


0. c) 


(a, 


{C,E,aB}, 


a) 


(c, 


{D,A,cE}, 


a) 


(a. 


{A,E,aA}, 


b) 


{b, 


0, a) 


(a, 


{C,E,aB}, 


b) 


(a, 


{D,B,aA}, 


a) 


(a. 


{A,E,aB}, 


a) 


{b, 


0. b) 


(b, 


{C,E,bC}, 


a) 


(a, 


{D,B,aA}, 


b) 


(a. 


{A,E,aB}, 


b) 


(b, 


0, c) 


(b, 


{C,E,bC}, 


b) 


{b, 


{D,B,bB}, 


a) 


(b, 


{A,E,bC}, 


a) 


(c, 


0, a) 


(b, 


{C,E,bA}, 


a) 


{b, 


{D,B,bB}, 


b) 


(b, 


{A,E,bC}, 


b) 


(c, 


0, b) 


(b, 


{C,E,bA}, 


b) 


(c, 


{D,B,cE}, 


b) 


(b, 


{A,E,bA}, 


a) 


(c, 


0, c) 


(b, 


{C,E,bB}, 


a) 


(a, 


{D, aA}, a) 




(b, 


{A,E,bA}, 


b) 


(a, 


{S, C, D, aA}, a 


) (b, 


{C,E,bB}, 


b) 


(a, 


{D,aA}, b) 




(b, 


{A,E,bB}, 


a) 


(a, 


{S,C,D,aA}, b) {a. 


{C,aC}, a) 




{b, 


{D,bB}, a) 




(b, 


{A,E,bB}, 


b) 


(b, 


{S,C,D,bB}, a 


) {a, 


{C,aC}, b) 




{b, 


{D,bB}, b) 




(a. 


{A, aC}, a) 




(b, 


{S,C,D,bB}, b] 


\ {a. 


{C,aA}, a) 




(c, 


{D,cE}, a) 




(a. 


{A,aC}, b) 




(c, 


{S,C,D,cE}, c) 


t {a. 


{C,aA}, h) 




(c, 


{D,cE}, b) 




(a. 


{A,aA}, a) 




(a, 


{C}, a) 


(a, 


{C,aB}, a) 




(a, 


{A}, a) 




(a. 


{A,aA}, b) 




(a. 


{C}, 6) 


(a, 


{C,aB}, b) 




(a, 


b) 




(a. 


{A, aB}, a) 




(b, 


{C}, a) 


(b, 


{C,bC}, a) 




{b, 


{A}, a) 




(a. 


{A,aB}, b) 




(b, 


{C}, b) 


(b, 


{C,bC}, b) 




{b, 


b) 




(b, 


{A,bC}, a) 




(a. 


{C,E}, a) 


(b, 


{C,bA}, a) 




(c, 


{A}, a) 




(b, 


{A,bC}, b) 




(a. 


{C,E}, b) 


(b, 


{C,bA}, b) 




(a, 


{A,E}, a) 




(b, 


{A,bA}, a) 




(b, 


{C,E}, a) 


(b, 


{C,bB}, a) 




(a, 


{A,E}, b) 




(b, 


{A,bA}, b) 




(b, 


{C,E}, b) 


(b, 


{C,bB}, b) 




(b, 


{A,E}, a) 




(b, 


{A,bB}, a) 




(a. 


{C,E,aC}, a) 


(a. 


{D,A,aA}, 


a) 


{b, 


{A,E}, b) 




(b, 


{A,bB}, b) 




(a. 


{C,E,aC}, b) 


(a. 


{D,A,aA}, 


b) 


{a, 


{A,E,aC}, 


a) 


(a. 


{B}, a) 
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(a. 


{B}, b) 




(b, 


{B,E,bA}, 


a) 


{a, 


{E}, a) 


{a, 


{aC}, a) 


{b, 


{B}, a) 




(b, 


{B,E,bA}, 


b) 


(a. 


{E}, b) 


(a, 


{aC}, b) 


(b, 


{B}, b) 




(b, 


{B,E,bB}, 


a) 


{b, 


{E}, a) 


(a, 


{aC}, c) 


(c, 


{B}, b) 




(b, 


{B,E,bB}, 


b) 


{b, 


{E}, b) 


(a, 


{aA}, a) 


(a. 


{B,E}, a) 




(a, 


{B,aC}, a) 




(a, 


{E, aC}, a) 


(a, 


{aA}, b) 


(a. 


{B,E}, b) 




(a, 


{B,aC}, b) 




(a, 


{E,aC}, b) 


(a, 


{aB}, a) 


(b, 


{B,E}, a) 




(a, 


{B,aA\, a) 




(a, 


{E,aA}, a) 


(a, 


{aB}, b) 


{b, 


{B,E}, b) 




(a, 


{B,aA}, b) 




(a, 


{E,aA}, b) 


(b, 


{bC}, a) 


{a, 


{B,E,aC}, 


a) 


(a, 


{B,aB}, a) 




(a, 


{E, aB}, a) 


(b, 


{bC}, b) 


(a. 


{B,E,aC}, 


b) 


(a, 


{B,aB}, b) 




(a, 


{E,aB}, b) 


(b, 


{bC}, c) 


(a. 


{B,E,aA}, 


a) 


(b, 


{B,bC}, a) 




(b, 


{E,bC}, a) 


(b, 


{bA}, a) 


(a. 


{B,E,aA}, 


b) 


(b, 


{B,bC}, b) 




(b, 


{E,bC}, b) 


(b, 


{bA}, b) 


(a. 


{B,E,aB}, 


a) 


(b, 


{B,bA}, a) 




(b, 


{E,bA}, a) 


(b, 


{bB}, a) 


(a. 


{B,E,aB}, 


b) 


(b, 


{B,bA}, b) 




(b, 


{E,bA}, b) 


(b, 


{bB}, b) 


{b, 


{B,E,bC}, 


a) 


(b, 


{B,bB}, a) 




(b, 


{E,bB}, a) 






{b, 


{B,E,bC}, 


b) 


(b, 


{B,bB}, b) 




(b, 


{E,bB}, b) 







4 Reduction of the Automata 

4.1 On the Minimization of Automata 

Although it is not known whether the minimal automaton for every linear con- 
junctive language is unique, from the practical point of view it might make sense 
to look for one of the automata that generate a particular language using a min- 
imal number of states. However, that cannot be algorithmically done; we cannot 
even compute this minimal size: 

Theorem 3. There is no algorithm to compute the minimal number of states 
in the automata for a given linear conjunctive language. 

The source of this result is the undecidability of the emptiness problem for 
linear conjunctive grammars |3|, and the fact that the minimal number of states 
in automata for an arbitrary language L C 17+ is 1 if and only if L = 0 or 
L = E+. 

Corollary 1. There is no algorithm to construct one of the minimal automata 
for a given grammar. 

These negative results show that there is no general method to reduce size of 
arbitrary given automata. However, the automata that are used in practice (i.e., 
those that one would generally need to reduce) usually are not deliberately 
obfuscated, and thus the practical case does not very much resemble solving 
the emptiness problem. It turns out that many automata (for instance, most of 
those constructed by the algorithms given in Section |3]) are subject to substantial 
reduction by quite simple methods. Let us discuss some of these methods. 

4.2 Reachable States 

A state q G Q is called reachable if there exists a string w G 47+ , such that 
A(I(w)) = q. Obviously, unreachable states are of no use and could be safely 
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removed from an automaton to save space. Unfortunately, by the same reasoning 
as above, there is no algorithm to determine them: 

Observation 1 An automaton (E,Q, I, S, F) generates the empty language if 
and only if every state in F is unreachable. It is undecidable whether a given 
state of an arbitrary given automaton is reachable. 

However, let us turn to the practical case. The construction methods described 
above are based on a naive assumption that if some two states q' and q" are 
reachable, then 5{q' ,q") is also a reachable state. This assumption is, in gen- 
eral, false, because q' and q" could be reachable through completely different 
computations, and thus cannot come next to each other to make the transition 
S{q',q") ever be performed. Therefore, these algorithms are almost certain to 
create a lot of superfluous states, some of which are so “unrealistic” that can be 
filtered out by quite simple necessary conditions of reachability. Let us consider 
some of these conditions. 

1. State filtering. Fix an integer k ^ 2, called the order of the method. The 
main idea is that “if some states qi^^. . . , qi^. {qi^ G Q) are supposed to be 
reachable, then the state A{qi^ ...qi^.) can also be assumed to be reachable”. 
Of course, this is not always true, because if every state qi- is reachable in 
itself this does not guarantee that they could ever appear next to each other 
in some computation. Still this method allows to determine some obviously 
unreachable states. 

— For every nonempty terminal string w G of length strictly less than 
k (i.e., 0< |w| <k), declare the state A{I{w)) to be reachable. 

— For every string q^^^ . . . q^^'> of k supposingly reachable states, add 
A{q'^^'^ . . . q^^'>) to the list of reachable states. 

Let us note that the algorithms in Section implicitly apply order 2 state 
filtering to the sets of states E x 2^ x E and E x x E. 

2. Tuple filtering. Fix an integer n ^ 1 - the size of the tuples being consid- 
ered. Out task is to determine the set R C Q” of all n-tuples of states that 
have a chance to be reachable. 

— For all n-tuples of terminal symbols (oi, . . . , a„) G U", declare the tuple 
(/(oi), . . . ,/(a„)) to be reachable. 

— For every two supposingly reachable n-tuples {q[, . . . , {qf , . . . , g"), 
such that q'j_^_i = q” (1 ^ j < n), consider the n-tuple 

^( 9 i 92 • • • iL-ilWn) = Hq'uQi) ■ ■ ■ Qn) *0 be reachable. 

3. Tuple filtering of higher order. Let us now consider the generalized 
approach that combines the ideas of tuple filtering with higher-order state 
filtering. Let k ^ 2 and n ^ 1. Order k filtering of n-tuples (or (fc, n)-filtering 
for short) traces partial computations from k overlapped n-tuples to a single 
n-tuple: 

— For all (n -|- Z)-tuples (Z < /c — 1) of symbols (oi, . . . , o„+i) (a^ G U”), 
declare the n-tuple 6 (/(ai), . . . ,/(a„+/)) to be reachable. 
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— For every k supposingly reachable n-tuples {{q^\ . . . such 

that for all 1 ^ z < fc and 1 ^ j < n, the n-tuple 

5^ ...q^n\.. qil"'’) = A{q[^'^ . . . q[^^) . . . is also con- 

sidered to be reachable. 

It is easy to observe that earlier mentioned filtering of order k is in fact 
{k, l)-filtering, while n-tuple filtering is (2, n)-filtering. 

Turning to the automaton for the language {wcw \ w G {a, 6}*} constructed 
out of the grammar given in Example [T] by the algorithm from Section 13.21 
only 58 of 158 original states are found to be reachable by (3, 2)-filtering, and 
brute-force tests reveal that each of these 58 states is actually used in some real 
computation. 



4.3 Equivalent States 

In this section we generalize the known technique of finite automaton minimiza- 
tion [T] for the case of our automata; that is splitting the set of states into 
classes of equivalence. Although in our case it will not necessarily yield the min- 
imal automaton, it can nevertheless be quite successful in reducing the number 
of states. 

Given an automaton M = (E, Q, /, i5, F), we are looking for a partition Q = 
Qi U . . . U (Qi n Qj = 0 for all i ^ j) of the set of states, such that every 
class of equivalence is either a subset of E or a subset of Q\F, and for every two 
classes Qi and Qj there exists a class Qk , such that for all q' G Qi and q” G Qj 
it would hold that S{q', q") G Qk- 

The algorithm for splitting the set of states of our automata into classes 
of equivalence can be loosely described as a two-dimensional generalization of 
the unary case of the algorithm for finite automata. The set of states is initially 
partitioned into accepting and nonaccepting states. Then the algorithm considers 
all pairs of classes of equivalence, for each looking whether all possible transitions 
from the states of the one class by the states from the other lead to the same third 
class of equivalence. If not so, then one of the two source classes of equivalence 
can be splitted into two or more classes, and this is being done by the algorithm. 

let partition = {F, Q\ F} he the initial partition of the set Q 

while it is possible to split the partition 

for all pairs (Qi,Qj) G partition x partition 

if exists q G Qi, such that K = {k\3q' G Qj : 5{q, q') G Qk} 

has cardinality of more than 1 

{ 

Let Qf"'’ = {q' \ S{q,q') G Qk} for all k G K 
partition = {partition \ {Qj}) U 

} 

else if exists q G Qj, such that K = [k\3q' G Qi \ 6{q' , q) G Qk} 

has cardinality of more than 1 
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{ 

Let = {q' \ 5{q',q) G Qk} for all k £ K 
partition = {partition \ {Qi}) U {Q[^^}keK 

} 



Table 1. Partition of the set of states of a sample automaton. 





States of the original automaton 


Accessing strings 


0 


(a, 0, a), (a, 0, b) 


acca, accb, aaaca 


1 


(a, {S,C,D,aA}, a), (a, {S,C,D,aAj, b) 


aca, aacaa, abcab 


2 


(a, {aC}, a), (a, {aC}, b) 


aaca, aacb, abca 


3 


(6, 0, a), (6, 0, 6) 


bcca, bccb, baaca 


4 


(a, {C,aA}, a), (a, {C,aA}, b) 


aacab, abcaa, aaacaab 


5 


(6, {S,C,D,bB}, a), {b, {S,C,D,bB}, b) 


bob, bacba, bbcbb 


6 


{a, {D,A,aA}, a), (a, {D, A,aA}, b) 


acaa, acaaa, acbaa 


7 


(c, {S,C,D,cE}, c) 


c 


8 


(a, {D, B,aA}, a), (a, {D,B,aA}, b) 


acba, acaba, acbba 


9 


(6, {bC}, a), {b, {bC}, b) 


baca, bach, hbca 


A 


(c, 0, a), (c, 0, b) 


cca, ccb, caca 


B 


(6, {CM}, a), (6, {CM}, b) 


bca, bacaa, bacab 


C 


{a, {C,aB}, a), {a, {C,aB}, b) 


acb, aacba, aacbb 


D 


(a, 0, c) 


aac, abc, acc 


E 


(a, {aC}, c) 


ac 


F 


(6, 0, c) 


bac, bbc, bcc 


G 


(6, {bC}, c) 


be 


H 


(c, 0, c) 


cc, cac, ebe 


I 


(c, {D,A,cE}, a) 


ca, caa, eba 


J 


(6, {D,A,bB}, a), (6, {D,A,bB}, h) 


bcab, bcaab, bebab 


K 


(a, {E}, a) 


a, aa, aaa 


L 


(a, {E}, b) 


ab, aab, abb 


M 


(6, {E}, a) 


ba, baa, bba 


N 


(6, {E}, b) 


b, bb, bab 


O 


(6, {D,B,bB}, a), {b, {D,B,bB}, b) 


bebb, bcabb, bebbb 


P 


(c, {D,B,cE}, b) 


cb, cab, ebb 


Q 


(a, {A,aA}, a), (a, {A,aA}, 6) 


aacaab, abcaaa, aacaaab 


R 


(a, {A,aS}, a), (a, {A,aB}, b) 


acab, acaab, aebab 


S 


(a, {B,aA}, a), (a, {B,aA}, b) 


aacbab, abebaa, aacahab 


T 


(a, {B,aB}, a), (a, {B,aB}, b) 


aebb, acabb, acbbb 


U 


(b, {C,bB}, a), (b, {C,bB}, b) 


baebb, bbeba, baaebab 


V 


(6, {A,bA}, o), {b, {A,bA}, b) 


bcaa, bcaaa, bebaa 


w 


(6, {A,hB}, a), {h, {A,bB}, b) 


bacabb, bbcaba, bacaabb 


X 


(6, {BM}, a), (6, {BM}, b) 


beba, bcaba, bebba 


Y 


(6, {B,bB}, a), {b, {B,bB}, b) 


baebbb, bbebba, bacabbb 



Once a partition can no longer be splitted, it is of the requested form. Then, 
like in the case of DFAs, it suffices to construct a new automaton, in which every 
state represents a class of states of the original automaton. 
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Applying this algorithm to the mentioned automaton for the language 
{wcw I w G {a, 6}*}, initially of 158 states and then reduced to 58 states by 
(3, 2)-filtering, allows to group these 58 states into 35 classes of equivalence. 

Table [T|shows the states of the 58-state automaton - triples (terminal, set of 
items, terminal) - forming each of these classes, and several accessing strings for 
each class. The classes themselves are denoted with boldface digits from 0 to 9 
and capital letters from A to Y. This partition can be used to construct a new 
automaton of 35 states out of the existing 58-state automaton. The 6 function of 
this reduced automaton is given in Table El the initial function maps I (a) = K, 
/(6) = N and /(c) = 7; the set of accepting states is F = {1, 5, 7}. 

These 35^ = 1225 entries of S, with a small addition of a 3-entry I table and 
35 bits of F, look reasonably small for practical use. 



Table 2. The function 5 for the sample automaton for {wcw \ w G {a, fo}*}. 





0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


A B 


C D E F G H 


I 


J 


K 


LMNOPQRS TUVWXY 


0 


0 


2 


0 


0 


2 


2 


0 


E 


0 


0 


0 


2 


2 


D D D D D 


0 


0 


K 


L 


K 


L 


0 


0 


0 0 0 0 


2 


0 0 0 0 


1 


0 


0 


0 


0 


0 


0 


6 


E 


R 


0 


0 


0 


0 


D D D D D 


6 


6 


0 


0 


0 


0 


RRQQRRO QQRR 


2 


0 


0 


0 


0 


0 


0 


1 


E 


C 


0 


0 


0 


0 


D D D D D 


1 


1 


0 


0 


0 


0 


C 


c 


4 4 C C 


0 


4 4 C C 


3 


3 


g 


3 


3 


9 


9 


0 


G 


0 


3 


3 


9 


9 


F F F F F 


0 


0 


MNMN 


0 


0 


0 0 0 0 


9 


0 0 0 0 


4 


0 


0 


0 


0 


0 


0 


6 


E 


R 


0 


0 


0 


0 


D D D D D 


6 


6 


0 


0 


0 


0 


RRQQRRO QQRR 


5 


0 


0 


0 


0 


0 


0 


X 


G 


O 


0 


0 


0 


0 


F F F F F 


X 


X 


0 


0 


0 


0 


O 


o 


X X Y Y 


0 


X X Y Y 


6 


0 


0 


0 


0 


0 


0 


6 


E 


R 


0 


0 


0 


0 


D D D D D 


6 


6 


0 


0 


0 


0 


R R Q Q R R 


0 


QQRR 


7 


0 


0 


0 


0 


0 


0 


0 


H 


0 


0 


0 


0 


0 


H H H H H 


0 


0 


I 


p 


I 


p 


0 


0 


0 0 0 0 


0 


0 0 0 0 


8 


0 


0 


0 


0 


0 


0 


6 


E 


R 


0 


0 


0 


0 


D D D D D 


6 


6 


0 


0 


0 


0 


RRQQRRO QQRR 


9 


0 


0 


0 


0 


0 


0 


B 


G 


5 


0 


0 


0 


0 


F F F F F 


B 


B 


0 


0 


0 


0 


5 


5 


B B U U 


0 


B B U U 


A 


AAAAAAAHAAAAAHHHHHAA 


0 


0 


0 


0 












B 


0 


0 


0 


0 


0 


0 


V 


G 


J 


0 


0 


0 


0 


F F F F F 


V 


V 


0 


0 


0 


0 


J 


J 


V V WW 0 


V V WW 


C 


0 


0 


0 


0 


0 


0 


8 


E 


T 


0 


0 


0 


0 


D D D D D 


8 


8 


0 


0 


0 


0 


T 


T 


S S T T 


0 


S S T T 


D 


0 


2 


0 


0 


2 


2 


0 


E 


0 


0 


0 


2 


2 


D D D D D 


0 


0 


K 


L 


K 


L 


0 


0 


0 0 0 0 


2 


0 0 0 0 


E 


0 


0 


0 


0 


0 


0 


1 


E 


c 


0 


0 


0 


0 


D D D D D 


1 


1 


0 


0 


0 


0 


c 


c 


4 4 C C 


0 


4 4 C C 


F 


3 


9 


3 


3 


9 


9 


0 


G 


0 


3 


3 


9 


9 


F F F F F 


0 


0 


MNMN 


0 


0 


0 0 0 0 


9 


0 0 0 0 


G 


0 


0 


0 


0 


0 


0 


B 


G 


5 


0 


0 


0 


0 


F F F F F 


B 


B 


0 


0 


0 


0 


5 


5 


B B U U 


0 


B B U U 


H 


A 


A 


A 


A 


A 


A 


A 


H 


A 


A 


A 


A 


A H H H H H 


A 


A 


0 


0 


0 


0 


A 


A 


AAAAAAAAA 


I 


0 


0 


0 


0 


0 


0 


0 


H 


0 


0 


0 


0 


0 


H H H H H 


0 


0 


I 


p 


I 


p 


0 


0 


0 0 0 0 


0 


0 0 0 0 


J 


0 


0 


0 


0 


0 


0 


X 


G 


O 


0 


0 


0 


0 


F F F F F 


X 


X 


0 


0 


0 


0 


O 


o 


X X Y Y 


0 


X X Y Y 


K 


0 


2 


0 


0 


2 


2 


0 


E 


0 


0 


0 


2 


2 


D D D D D 


0 


0 


K 


L 


K 


L 


0 


0 


0 0 0 0 


2 


0 0 0 0 


L 


0 


2 


0 


0 


2 


2 


0 


E 


0 


0 


0 


2 


2 


D D D D D 


0 


0 


K 


L 


K 


L 


0 


0 


0 0 0 0 


2 


0 0 0 0 


M 


3 


9 


3 


3 


9 


9 


0 


G 


0 


3 


3 


9 


9 


F F F F F 


0 


0 


MNMN 


0 


0 


0 0 0 0 


9 


0 0 0 0 


N 


3 


9 


3 


3 


9 


9 


0 


G 


0 


3 


3 


9 


9 


F F F F F 


0 


0 


MNMN 


0 


0 


0 0 0 0 


9 


0 0 0 0 


O 


0 


0 


0 


0 


0 


0 


X 


G 


o 


0 


0 


0 


0 


F F F F F 


X 


X 


0 


0 


0 


0 


o 


o 


X X Y Y 


0 


X X Y Y 


P 


0 


0 


0 


0 


0 


0 


0 


H 


0 


0 


0 


0 


0 


H H H H H 


0 


0 


I 


p 


I 


p 


0 


0 


0 0 0 0 


0 


0 0 0 0 


Q 


0 


0 


0 


0 


0 


0 


6 


E 


R 


0 


0 


0 


0 


D D D D D 


6 


6 


0 


0 


0 


0 


R R Q Q R R 


0 


QQRR 


R 


0 


0 


0 


0 


0 


0 


8 


E 


T 


0 


0 


0 


0 


D D D D D 


8 


8 


0 


0 


0 


0 


T 


T 


S S T T 


0 


S S T T 


s 


0 


0 


0 


0 


0 


0 


6 


E 


R 


0 


0 


0 


0 


D D D D D 


6 


6 


0 


0 


0 


0 


R R Q Q R R 


0 


QQRR 


T 


0 


0 


0 


0 


0 


0 


8 


E 


T 


0 


0 


0 


0 


D D D D D 


8 


8 


0 


0 


0 


0 


T 


T 


S S T T 


0 


S S T T 


U 


0 


0 


0 


0 


0 


0 


X 


G 


O 


0 


0 


0 


0 


F F F F F 


X 


X 


0 


0 


0 


0 


O 


O 


X X Y Y 


0 


X X Y Y 


V 


0 


0 


0 


0 


0 


0 


V 


G 


J 


0 
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Abstract. A classical construction assigns to any language its (ordered) 
syntactic monoid. Recently the author defined the so-called syntactic 
semiring of a language. We show here that elements of the syntactic 
semiring of L can be identified with transformations of a certain modifi- 
cation of the minimal automaton for L. 

The main issue here are the inequalities r(a;i, . . . , Xm) C L and equations 
r{xi , . . . , Xm) = L where L is a given regular language over a finite alpha- 
bet A and r is a given regular expression over A in variables xi, . ■ ■ , Xm- 
We show that the search for maximal solutions can be translated into the 
(hnite) syntactic semiring of the language L. In such a way we are able 
to decide the solvability and to find all maximal solutions effectively. 

In fact, the last questions were already solved by Conway using his 
factors. The first advantage of our method is the complexity and the 
second one is that we calculate in a transparent algebraic structure. 

Keywords: syntactic semiring, language equations 



1 Introduction 

The syntactic monoid is a monoid canonically attached to each language. In |1] 
the author introduced the so-called syntactic semiring of the language under 
the name syntactic semilattice-ordered monoid. The main result of that paper 
is an Eilenberg-type theorem giving a one-to-one correspondence between the 
so-called conjunctive varieties of regular languages and pseudovarieties of idem- 
potent semirings. The author’s next contribution |S] studies the relationships 
between the (ordered) syntactic monoid and the syntactic semiring of a given 
language. We mention here only that the first one is finite if and only if the 
second one is finite and that these two structures are equationally independent. 
Also several examples of conjunctive varieties of languages are presented there. 
In certain sense our notion is a modification of Reutenauer syntactic algebra for 
the case of the Boolean semiring. 

The study of the inequalities r(xi , . . . , Xm) Q L, where r is a regular expres- 
sion in variables x\, . . . , Xm (without constants) and L is a regular language, was 
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started already by Conway in He shows that the components of maximal so- 
lutions are among intersections of so-called factors (see Sect. 7). In particular for 
the equations X Y = L he showed that for any languages P, Q such that P Q = L 
there exist regular languages P', Q' such that P C P', Q C Q\ P' -Q' = L. This 
result was reproved among other things by Kari [3j . Salomaa and Yu gave a 
direct construction of the corresponding automata. Choffrut and Karhumaki [1] 
generalized the result to a system of equations. 

We consider here a single equation r{x\, . . . ,Xm) = L where P is a regular 
language over a finite alphabet A and r is a regular expression over A in vari- 
ables X\, , Xm. It is convenient to consider in the same time also the inequality 
r(xi, . . . , Xrn) C L since we show that its maximal solutions are in one-to-one cor- 
respondence with the maximal solutions of this inequality shifted into the finite 
syntactic semiring of the language L, and maximal solutions of r{x \, . . . , Xm) = L 
are some of the maximal solutions of r(xi, . . . , Xm) Q L. Components of maximal 
solutions are shown to be among so-called P-closed languages, a finite family of 
regular languages constructed from L. Our results lead to an effective procedure 
for finding all maximal solutions of r(xi, . . . ,Xm) = L. In fact our translation 
from languages to the syntactic semirings is an improvement of that from [T] 
where power monoids of the syntactic monoids are used. In jT[ they lose the 
one-to-one correspondence mentioned above and the syntactic semiring is often 
much smaller than the power monoid of the syntactic monoid (we calculate a 
concrete example where the syntactic monoid has 12 elements and the syntactic 
semiring only 16 elements). Our methods also provide constructions of automata 
which correspond to components of maximal solutions in terms of the minimal 
automaton of the language L. We can also deal with systems of inequalities using 
translations into products of syntactic semirings of languages occurring on the 
right hand sides. 

In the next section we recall the notion of syntactic semiring and introduce 
the notion of P-closed languages. In Section 3 we show that for a solution of 
r{xi, . . . ,Xm) = L its L-closure (applied componentwise) is again a solution. 
Section 4 is devoted to a transition of computations from languages to the syn- 
tactic semiring of L. In Section 5 we prove the correctness of an algorithm for a 
computation of the syntactic semiring, in Section 6 we calculate concrete exam- 
ples. 

As stated above, already Conway [2] presents a method for a finding all max- 
imal solutions of r(xi , . . . , x^) Q L. In our approach the number of candidates 
for components of maximal solutions is exponentially smaller. Moreover, we use 
transparent calculations in a finite structure whose elements are represented by 
mappings. The details of the comparison are explained in the last section. 

2 Syntactic Semiring and Closed Languages 

A structure (O, •, <) is called an ordered monoid if (O, •) is a monoid with the 
neutral element 1, (0,<) is an ordered set, and a,b,c G O, a < h implies 
both ac < be and ca < cb. Further, a structure (S', qV) is called an idempotent 
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semiring if {S, •) is a monoid, {S, V) is a semilattice, and a,b,c G S implies both 
a{b y c) = aby ac and (a V b)c = ac V be. The last structure becomes an ordered 
monoid with respect to the relation < defined by a < 6 a V 6 = 6, a,b G S. 

In a finite idempotent semiring (S', •,V) we can define a unary operation *; 
namely for a G S, we have a* = 1 V a V V ... V a* where k is the smallest 
integer such that lVaVa^V...Va* = lVaVa^V...V 

A language L over A defines the so-called syntactic congruence on the free 
monoid (A*, •) over A by 

M u if and only if ( V p, q G A* ) ( puq G L 4=^ pvq G L) . 

The factor-structure (A*,-)/ is called the syntactic monoid of L and we 
denote it by (0 (L), •). It is ordered by 

V ~L ^ u if and only if ( V p, g G A* ) ( puq G T pvq G L ) 

and we speak about the ordered syntactic monoid. We also write v u instead 
of u < u 

Let F (A*) be the set of all non-empty finite sets of words over A. Notice that 
the algebra (F (A*), •, U) is a free idempotent semiring over A. Now L defines its 
congruence by 

{ui, . . .,Uk} {wi, . . . ,Wi} if and only if 

( V p,(j G A* ) ( puiq, ...,pukq G L yy pviq, ...,pviq G L) . 

The factor-structure is called the syntactic semiring of L; we denote it by 
(S(L),.,V). 

Sometimes it is useful to consider F°(A*) = F(A*)U{0} and to define 
on (F°(A*), •, U). The factor-structure in this case is denoted by (S°(L),-,V). 
We also write {w} {ui, ■ . ■ , Ufe} if 

( Vp, g G A* ) ( puiq,...,pukq G L pvq G L) . 

In fact, it is equivalent to {u \, . . . , Uk, u} {ui, . . . , Uk} and {?;} :<l {m} is the 
same as v u defined above. 

A language Q over A is called L-saturated if it is a union of «L-classes, 

Q is said to be L-hereditary if u G Q, v :<l u implies v G Q, and Q is L-closed 
if ui, . . . , Mfc G <5, {v} Al {ui, . . . , Uk] implies v gQ. 

Clearly, for a regular language L over a finite alphabet A, 

Q is L-closed Q is L-hereditary Q is L-saturated Q is regular 

and the language L itself is L-closed. 

For any language P over A there exists the smallest L-closed language Q 
over A containing P; we speak about L-closure and we write Q = P^. In fact 

P^ = {v G A* \ {v} d:L {ui, . . . , Uk} for some ui, . . . ,Uk G P } . 

Those familiar with Conway’s techniques may realize that L-closed sets are 
exactly intersections of factors - see also Section 7. 
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3 Inequalities and Equations 

Let A = {ai, . . . , a„} (n > 1) be a fixed finite alphabet, let X = {x\, . . . , Xm} 
(m > 0) be a finite set of variables. We denote by Reg (A,X) the set of all terms 
in the language of nullary operational symbols 0, oi, . . . , a„, binary operational 
symbols •, + and unary operational symbol * over the variables xi, . . . , Xm- The 
elements of Reg(A, X) are called regular expressions over A in X. 

Let r be the realization of r G Reg {A, X) in the algebra 

£= (2^*,0,{ai},...,{a„},-,U,*) 



of all subsets of A* where • is the catenation, U is the set-theoretical union and 
* is the Kleene star. Notice that {1} = 0*. 

For a given r G Reg {A, X) and a regular language L over A we will consider 
the inequality 

r{xi, . . .,Xm) C L (*) 

and the equation 

r(xi, . . .,Xm) = L . (**) 

We could write more formally r(xi , . . . , Xm) < I and r(xi , . . . , Xm) = I where 
I G Reg {A, 0) is a regular expression defining the language L. 

An m-tuple (Pi, . . . , Pm) of languages over A is called a solution of the in- 
equality (*) in C if r(Pi, . . . , Pm) C L and similarly, it is a solution of (**) in C 
if r(Pi, . . . , Pm) = L. A solution (Pi, . . . , Pm) of (*) in C is maximal if for every 
other solution (Qi, . . . , Qm), 

Pi ^ Qit • ' m A*m ^ Qm implies Pi — t^l, . . . , Pm — Qm ■ 

Similarly for the equation (**). 

The following is a reformulation of Conway’s Theorem VI. 9. 

Theorem 1. Let (Pi, . . . , Pm) be a solution of the inequality r{xi , . . . , Xm) Q L 
in C. Then {Pi , ■ ■ ■ ,Pm) again a solution of this inequality. Consequently, the 
components of any maximal solution are L-closed; in particular, they are regular 
languages. 

Proof. In fact we will prove the following: 

For any languages U, P,Q 2 , ■ . ■ , Qm, V over A (m > 1) and 
r G Reg {A, {x \, . . . , Xm}), we have 

U ■ f{P, Q 2 ,...,Qm)-V CL implies U ■ f{P^,Q2, ■■■,Qm)-VCL . 



To get the Theorem put Lf = V = {1} and permute arguments in r. 

We left to the reader to use induction with respect to the complexity of r, 
that is, to consider consecutively the cases: 

r = 0, ai, . . . , a„, r = x\, r = X 2 , ■ . ■ , Xm, r = s + t, r = s-t, r = s*. 

□ 
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4 Prom Languages to the Syntactic Semiring (and Back) 

Theorem 2. Let L he a regular language over a finite alphabet A = {ai, a„}. 
The mapping 

(k-P^\J{ I u£P}, PC A* 

is a surjective homomorphism of the algebra L = (2^ , 0, {oi}, . . . , {an}, •, U,* ) 
onto the algebra = (S 0 (aij . . . {a„| •, V,* ). In fact, for 

every r G Reg {A, {xi, . . . , Xm}), Pi, ■ ■ ■ , Pm C A* , we have 

(j){f{Pl,...,Pm)) =r{(f){Pi),...,(j){P„f)) (*) 

where r is the realization of r in 5°. 

Moreover, for any P C A* , its closure P^ is the largest language with its 
4>-image equal to 4>{P)- Consequently, (j> is a bijection between the set L{A) of 
all L-closed languages and the set S°(L). The inverse mapping is also isotone 
and can he expressed as 

if : {ui,. . . ,Uk}^L ^ { u e 4* I {?;} {■ui, . . . ,Mfc| } . 

Proof. Note that in the formula for (j) the join is considered in a finite structure; 
in fact we can take instead of P any finite set Q intersecting exactly the same 
«L-classes as P does. 

Clearly , Mfc}) = |t6i, . . . , Uk}^L which gives the surjectivity. 

Now ^(0) = VI {m|~l I M G 0 } = 0~L, a convention. 

Clearly, 4>{{ai\) = \/{ (u}~l | uG { 0 ,} } = {ai|~L, i = 1, . . . ,n. 

Further, 4>{P-Q) = \J{ {u}'^l \ u G P Q } = VI {w-w}~l | v G P,w G Q } 
= VI I V G P } 'Vi 1^1 \ w G Q } due to the remark opening this 

proof 

We have also <j){P U Q) = VI {«} | u G P C Q } 

= VI (m}~l I uG P} V VI {w}~L I uGQ} = (f{P) V 

Finally, 4>{P*) = </)({l}UPUP^U. . ■) = VI {u}^l \ u G (IjUPUP^U. . . } 
= VI {u}^L I uG{ 1}UPU...UP^} since this expression is nondecreasing 
with respect to k and our calculations are in a finite structure 
= ^(1) V^(P)V...V^(P'=). 

On the other hand, (4>{P))* = {1} V 4>{P) V {(j){P)Y V . . . 

= {1} V <t){P) V ^(P^) V ... by the item concerning the product 
= {1} V </>(P) V ... V 4>{P^) by the arguments as above; moreover, we can choose 
k = l. 

Basics of universal algebra yield the expression (*). 

If (uj Al {ui,. ■ .,Uk}, ui,. .. ,Uk G P, then {wj < {wi| V ... V 
{uk}^L and thus 4>{P^) = fi{P). 

Conversely let (j){Q) < 4>{P). Then, for all v G Q, we have {f| 

V . . . V {uk}^L for some u\, . . . ,Uk G P (finiteness of S°(P) ) and v G P^ . 

The statements about are obvious. □ 
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For a regular language L over a finite alphabet A, we can translate the 
inequality (*) and the equation (**) into the inequality 

r{xi,...,Xm) (t) 

and the equation 

r{xi,...,x^) = (j>{L) . (ft) 

Theorem 3. (i) The inequality (*) is solvable in C if and only z/ (0, . . . , 0) is a 
solution. 

(ii) If (Pi, . . . , Pm) solves (*) in C, then , 4>{Pm)) solves (f) in 5°. 

The same is true for (**) and (ff). 

(Hi) Conversely, if {c\, ... ,Cm) solves (f) in 5°, then . . . jif^Cm)) 

solves (*) in L. 

(iv) Maximal solutions of {*) in C are in a one-to-one correspondence with 
maximal solutions o/(f) in 5°. Every solution of {*) in C is contained (compo- 
nentwise) in a maximal one. 

(v) Every maximal solution of (**) in C is also a maximal solution of (*). 
Every solution of (**) in C is contained in a maximal one. 

Proof. The item (i) follows from the fact that f is isotone, i.e. 

Pi ^ Ql: ■ ■ m Pm ff Qm implies r(Pi , ... , Pm) ff t Qm) • 

(ii) The application of the (isotone) operator (f to r{P\, . . . ,Pm) ff L yields 
by Theorem 2 r((/>(Pi), . . . , (j){Pm)) < (f{L) and the same for the equality. 

(iii) Let (ci, . . . , Cm) be a solution of (f) in 5°. Let Qi = ip{ci), i = 1, . . . ,m. 
Then <f>{Qi) = Ci, i = 1, . . . ,m and r{(j>{Qi ), . . . , 4>{Qm)) < 4>{L)- Consequently 

) Qm)) < 4>{L) and the application of if yields (r(Qi, ■ • • , Qm))^ ff L 
and thus f{Qi , . . . , Qm) ff L. 

(iv) Let {Pi, . . . , Pm) be a maximal solution of (*). Then Pf = P^, i = 
1, ... ,m by Theorem 1. Let (ci, . . . , c^) be a solution of (f) such that 
{(f{Pi), . . . , <f{Pm)) < {c\, . . . ,Cm) (componentwise). Application of if gives that 
{if{ci ), . . . , if{cm)) is a solution of (*) which contains (Pi, . . . , Pm)- Thus Pi = 
if{ci), (f{Pi) = Ci, i=l,...,m. 

Conversely, let (ci , . . . , Cm) be a maximal solution of ()^). Let {Qi, ■ ■ • ; Qm) 
be a solution of (*) containing {if{ci), . . . ,if{cm))- Application of (f gives that 
{(f{Qi ), . . . , 4>{Qm)) is a solution of (f) containing (ci, . . . , Cm)- Thus Ci = <f{Qi), 
if{ci) = if{(f{Qi)) f>Qi, i=l,...,m. 

Finally, for a solution (Pi, . . . , Pm) of (*), the solution {(f{Pi ), . . . , <f{Pm)) of 
(t) is contained in a maximal solution (ci, . . . , Cm) of (f) (finiteness of S °(L)). 
Now (Pi, . . . , Pm) is contained in a maximal solution {if{ci ), . . . , if{cm)) of (*). 

(v) The first statement is clear. For a given solution (Pi, . . . , Pm) of (**) take 
a maximal solution of (*) which contains (Pi, . . . , Pm)- 

□ 
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5 A Construction of the Syntactic Semiring 

We extend here a well-known construction of the syntactic monoid of a regular 
language to the case of the syntactic semiring. This algorithm was announced in 
13; here we prove its correctness. 

Let a regular language L over a finite alphabet A be given. Classically one 
assigns to L its minimal automaton A using left quotients ; namely 
D = { u~^ L \ u A* } is the (finite) set of states, 
a £ A acts on u~^L by (u~^L) ■ a = a~^{u~^L). 

Denote by [u] : d d - u the transformation of the set D induced hy u £ A* . 
do = L is the initial state and u~^L is a final state if and only if u S L. 

Now (0 (L), •) is the transition monoid of A\ more precisely: 

(0(L),-) is isomorphic to { [u] | u S A*} with [m] • [d] = [u ■ v] and the 
isomorphism is given by m i— > [u] . 

The states are ordered by the opposite inclusion: 

u~^L < v~^L if and only if u~^L A v~^L . 

The order on 0 (L) is given by 

[/] < [g\ if and only if for every d € D we have d ■ f < d ■ g . 

Indeed, / g iff ( V p, q G A*)( pgq g L pfq G L ), that is iff 
{y p,qG A)( q G {pg)~^L ^ q G (pf)~^L ), that is iff 

( Vp G ^*)( {pg)~^L C {pf)-^L ), that is iff ( V p G A*)( {p~^L)-g C {p~^L )•/ ). 
We extend the set of states to 

D = { di n . . . n dm |toGIN, di, . ■ ■ , dm G D } . 

As above, d G D is a, final state if and only if 1 G d. The action of a letter a G A 
is now given by 



(di n . . . n dm) ■ a = di • a n . . . C dm ■ a ■ 

It can be extended to transitions induced by non-empty finite sets of words by 
d • {ui, ..., Uk} = d ■ ui n ... D d ■ Uk for d G D, ui, ..., Uk G A* . 

Again, denote by [Mi,...,rtfc] the transformation of D given by the set 

{Ui, . . .,Uk}. 

We show that (S (L), •, V) is isomorphic to { [ui, . ■ ■ ,Uk] \ ui,. ■ ■ ,Uk G A* } 
with the operations of composition and 

[wi, . . . ,Mfc] V [vi, ...,Vl] = [ui, ...,Uk,Vi,.- -,Vl]. 

The isomorphism is given by: {mi, . . . , Uk}^L '— *■ [mi, . . . , Uk]. 

We have to show that {rti, . . . , Uk} i vi} is equivalent to 

[ui,...,Uk] = 

Indeed, {ui , . . . , Uk} {v\, ... , 1 );} if and only if 
{y p,q G A* ) ( puiq, ...,pukq G L pviq, ...,pviq G L ), that is iff 
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{yp,q€ A*) {qe {pui)~^Lr]. . .n{pUk)~^L q€ {pvi)~'^Ln. . .n{pvi)~'^L), 
that is iff {y p G A* ) {pui)~^L n . . . n {puk)~^L = {pvi)~^L n . . . n {pvi)~^L , 

that is iff {y p G A* ) p~^L ■ uiC\ . . .C\p~^L ■ Uk = p~^L ■ viC\ . . .C\ p~^L ■ vi , 

that is iff {y p G A* ) p~^L ■ {ui, . . . , Uk} = p~^L ■ {?;i, . . . , vi}, that is iff 

{y di,. . . ,dm G D ) (di n . . . n dm) -{ui,..., Uk} = (di n . . . n dm) • {wi, . 

Further, {d- {ui,. , Uk}) • {ui, . . . , u;} = {d ■ ui n . . . n d ■ Uk) • {ui, . . . , u;} 
= n{c? • {u^VJ) \ i = !,.■■, k, j = !,.■■, 1} = d - {{ui,.--,Uk} ■ {wi,. • .,«;}) 
and thus [ui, . ..,Uk]- [ui, ...,vi] = [{ui, ...,Uk}- {ui, . . . ,u;}]. 

To make the computation finite, one considers instead of words from A* 
their representatives in 0{L). Moreover, it suffices to take only the actions of 
hereditary subsets of (0 (L),<) (a subset H of an ordered set {B, <) is hereditary 
li b G B,a G H,b < a implies b G H). 



6 Examples 

1. Let A= {a,b} and consider the language L = A* aba A*. 

Denoting 1 = L, p = L LI baA* , q — L L aA* , 0 = A*, we see that the 
minimal automaton of L looks as follows. 




The order on states is given hy 0 < p < 1, 0 < q < 1, p, q incomparable. 
Note that pD q= 1. 

Now we calculate the transitions given by 1, a, &, a^, ab, ba, b^, a^, . . . and reg- 
ular expressions for corresponding « /,-classes. 



u 


1 p q 0 


U «L 


1 


1 p q 0 


1 


a 


p p 0 0 


a+b'^{b + a+b"^)* + a+ 


b 


1 q 1 0 


b 


ab 


q q 0 0 


0 + 52(6 -h a+62)*o+5 -k 0+5 


ba 


p 0 p 0 


5(5-ko+52)*o+ 


52 


1110 


52 ( 5 -k 0 + 52 )* 


aba 


0 0 0 0 


A* aba A* 


aP 


110 0 


0 + 52 ( 5 -ka+ 52 )* 


bab 


q 0 q 0 


60 + 52(5 -k o+52)*a+5 -k 5o+5 


b^a 


p p p 0 


52(5-k o+52)*o+ 


bab"^ 


10 10 


60 + 52 ( 6 -k 0+52)* 


b'^ab 


q q q 0 


52(5-ko+52)*o+5 
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The syntactic monoid of L has the presentation 

< a,h I = a, 6^ = 6^, aba = 0, ab^a = a, b'^ab^ = b^ > 



and its order is given componentwise according to the second column. 

Some hereditary subsets of 0 (L) act on D in the same way as some elements 
of 0(L): [o, 6a,a6a] = [6^a] (this means {b'^a} :<l {a,ba,aba} ); some subsets 
give rise to new transformations: [a,b'^ab] : {l,p, q, 0) i— > (1, 1, q, 0). This gives 



the label 



1 1 q 0 

a, b^ab 



in the order reduct of the syntactic semiring of L depicted 




Further, S°(L) = S (T) since {aba} and (j>{L) = {aha}^L- 

(i) Consider the inequality < L and the equation x"^ = L. Write 

{ui, . . . ,Mfc} instead of {mi, . . . The elements of S°(T) with < 4>{L) 

are exactly {aba}, {ab}, {ba} (not, for instance, {ab,ba} since {ab,ba}^ = 
{bab,a,aba} = { 1 } ). The last two are maximal solutions. 

The corresponding languages are 

^{{ab}) = ab U aba = a~^b'^{b + a~^b^)*a~^b + a~^b + A*abaA*, 

^p{{ba}) = ba U aba = b{b + a^b‘^)*a~^ + A*abaA*. 

In all three cases we have x^ = {aba} in 5° but not x"^ = L in L. 
Summarizing, x^ C L has two maximal solutions in C and = L is not 
solvable there. 

(ii) Clearly (A*a,baA*) solves X - Y = L. We have {a, 6}*{a} = {6^}*{a} = 
{6^}{a} = {6^a}, {&}{a}{a, 6}* = {bab'^} and thus our solution is contained in 
the maximal solution {'tp{{b‘^a}),ip{{bab‘^}) = 

{b^a U a U 6a «z, U aba 6o6^ U 6o U bab U aba «l)- 
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Although our first example is quite non-trivial, it demonstrates only a special 
case of our construction of the syntactic semiring. In the following example 
the set D of all states of the minimal automaton is not closed with respect to 
intersections. 

2. Let A = {a, b} and consider the language L = (abA*)' (the complement of 
abA* ). 

Denoting p = L, q = (bA*)' , 0 = A*, i = 0, r = p O q we see that the 
minimal automaton of L and its extension look as follows (the covering relation 
of the order of D is indicated by broken lines) . 





Calculations of the syntactic monoid and the syntactic semiring of L yield 
the following mappings from D to D. 



Ul , . . . , Uk 


p q 0 1 


1 


p q 0 1 


a 


q 0 0 1 


b 


0 10 1 




0 0 0 1 


ab 


10 0 1 


1,0 


r q 0 1 


l,b 


p 1 0 1 


l,ab 


1 q 0 1 


a, b 


q 1 0 1 


b, ab 


110 1 


1, o, b 


r 1 0 1 
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The syntactic monoid of L has the presentation 

< a,b \ ba — b, b'^ = b, = o? , a^b = > 



The order reduct of the syntactic semiring of L depicted below. 




plOl 

1, b 



pqOl 

I 



Here S °(L) yf S (L) since there is no m G A* satisfying 
{y p,q G A*) ( puq G L ). Consequently (S°(L),<) is (S(L),<) with a new 
bottom element which represents 0 . 

Consider the inequality x* < L. The maximal elements of {S^{L),<) with 
the property x* < (j){L) are exactly and The corresponding 

languages are 1 + a + a'^ A* and 1 + bA* + a'^ A*. 



7 A Comparison of Our Results with Those by Conway 

In [2] Conway studies the inequality (*) in case that r G Reg (0, {x \, . . . , Xm})- 
Let the maximal solutions of 2 : 1 X 2 C L be exactly the pairs 
(Hi, Qi ), . . . ,{Pi,Qi) and let Fij be the maximal solution of PixQj C L, i,j = 
1, . . . ,1. He calls the components of maximal solutions of xi . . . Xfe C L (fc G IN) 
factors and their first (last) components left (right) factors. He shows that 

(i) the maximal solutions of PxQ C L {P, Q C A*) are factors; 

(ii) the left (right) factors are exactly Pi, ... ,Pi {Qi, . . . ,Qi) and the factors 
are exactly Tiy’s {i, j = 1, . . . , 1); 

(iii) the components of maximal solutions of (*) are among intersections of 
factors. 
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In |2] one calculates the Qi’s as intersections of left quotients of L (in our 
notation {Qi , . . . , Qi} = D), then one proceeds by calculating Pi as 

and as f|p6P. P~^Pj (b J = 1, ■ • ■ , 0- 

Let us compare the complexities of both approaches with respect to the 
number A of states of the minimal automaton for L, which equals to the number 
of left quotients of L. 

The number of right factors of L equals the number of intersections of left 
quotients of L, which is at most 2"^. Thus the number of factors is less or equal to 
{2^)"^ = 2^^ and the Conway’s number of candidates for components of maximal 
solutions of (*), which are just intersections of factors, is at most 2^ . 

On the contrary, our number of candidates for components of maximal so- 
lutions of (*) equals to the number of elements of the syntactic semiring of L, 
which is at most the number of mappings from D to D (since any transforma- 
tion [ui, . . . , Uk] of D is fully determined by its action on D). The last number 
is (2'^)'^ = 2"^ . Thus we are doing exponentially better. 

The second advantage of our methods is that we use transparent calculations 
in the finite syntactic semiring of L whose elements are represented by mappings. 

Finally, it is worth noticing that every L-closed language, say Q, is an inter- 
section of factors. Indeed, Q = HI P~^Lr~^ \ pQr C L, p,r £ A* } and any 
p~^Lr~^ is a factor since it is the maximal solution of pxr C L. 
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Abstract. We describe a class of transitive semiautomata whose power 
automata are reduced: any two reachable sets of states have distinct be- 
havior. These automata appear naturally in the study of one-dimensional 
cellular automata. 



1 Motivation 

The acceptance languages of transitive semiautomata enjoy a number of special 
properties. Notably, they are factorial, extensible and transitive (FET): uv G 
L u,v G L, u G L 3a,b G S {aub G L), and u,v G L 3 x {uxv G L) 
where S denotes the underlying alphabet). For languages of this type there 
is an alternative notion of minimal deterministic automaton, first introduced 
by Fischer and discovered independently by Beauquier in the form of the 
0-minimal ideal in the syntactic semigroup of L. A Fischer automaton is a deter- 
ministic transitive semiautomaton. For each factorial, extensible and transitive 
language there is a unique Fischer automaton that minimizes the number of 
states. Thus, for any FET language L we can measure the state complexity in 
two ways: as the size /r(L) of the standard minimal DFA for L, or as the size 
fJ-F(L) of the minimal Fischer automaton for L. The minimal Fischer automa- 
ton naturally embeds into the ordinary minimal deterministic automaton, see 
0. Thus, except in the trivial case where L = E* , we have Hf{L) < — 1 

and one can compute the minimal Fischer automaton in linear time given the 
standard minimal DFA. Note that ^jlf{L) is arguably a better measure for the 
complexity of L than /r(L) — 1 since the minimal Fischer automaton can be quite 
small even when the minimal DFA is large, see m- 

However, there is a fundamental obstruction to computing or plf{L) 

for a FET language L\ these languages are often given as a nondeterministic 
transitive semiautomaton A. If the semiautomaton has size n, the only a priori 
bounds available are ijlf{L) < h-{L) — 1 and /i(T) < 2" since the accessible part 
pow(yl) of the Rabin-Scott power automaton of A has size at most 2". We write 
f{A) for the size of this automaton. One can construct the minimal automaton 
in polynomial time from pow(.A), but as shown in [TT] it is PS PAG E-hard to 
determine whether 7 t(A) is less than a given bound. Hence there is no feasible 
computational shortcut that would allow one to determine the size of the power 
automaton without actually constructing the machine. Moreover, m show 
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that exponential blow-up where tt{A) is equal to or close to the upper bound 2" 
occurs quite frequently. 

We are particularly interested in the languages that arise in the study of 
one-dimensional cellular automata, see I14I7I2I12I6I . For our purposes here, a 
cellular automaton can be represented as a local map p : H’" ^ E that extends 
naturally to a global map on bi-infinite words on the alphabet E, usually referred 
to as configurations in this context. The cover of a configuration is the set of 
all its finite factors. It is easy to see that cov(p), the union of all covers of 
p{X) where X ranges over all configurations, is a regular language. Indeed, the 
natural semiautomaton for this language is a de Bruijn automaton B(p) whose 
state set is E'^~^ and whose transitions are of the form (ax , p{axb) , xb) where 
a,b ^ a: and x G see |8llOI4j . The same holds for the cover languages 

cov(p^) associated with the iterates of the global map. The de Bruijn automaton 
here has size where k denotes the size of the alphabet E. Thus, even for 

binary alphabets the only obvious upper bound is 

m(cov(p‘)) < 

In the mid eighties. Wolfram performed extensive calculations in an effort to 
understand the behavior of the sequences ^(cov(p‘)), see [Idll4ll5j . As shown in 
jS] the doubly exponential upper bound can be reached for any width w at time 
t = 1, though none of the iterates p*, t > 1, display full blow-up. 

One can characterize the cellular automata p for which full blow-up occurs as 
follows. Define a 1 -permutation automaton to be any transitive semiautomaton 
that is obtained by changing the label of a single transition in a permutation au- 
tomaton. In a permutation automaton each symbol in E induces a permutation 
of the state set; equivalently, the automaton is deterministic, codeterministic 
and complete. When the selected transition is a loop we refer to the automaton 
as a loop- 1 -permutation automaton. For de Bruijn automaton B(p) full blow-up 
occurs only for 1-permutation automata. For loop- 1-permutation automata we 
have a particularly simple situation, see [^. 



Theorem 1. Let p be a binary eellular automaton sueh that B(p) is a loop- 
i-permutation automaton of size n = 2’"“^. Then pt(A) = ’x(A) = 2” and 
Pf(A) = p,(A) - 1. 

In this paper we extend this result in two ways. First, we show that full blow- 
up occurs in all degenerate 1-permutation automata, see below for definitions. 
Second, we demonstrate that in general for any 1-permutation automaton A we 
have p(A) = 7t(AI), regardless of the actual size of the power automaton. Hence, 
given the power automaton pow(A) we can compute p,p(A) in linear time. In 
the next section we briefly introduce some terminology, but in general we refer 
the reader to the references for more background information. Section [3] contains 
the main result, and in the last section we comment on open problems. 
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2 One-Permutation Automata 

Let A = ( Q, L", • ) be any automaton, p a state in A. We will often abuse notation 
and write p rattier than {p}. We denote [p ]_4 the behavior of state p in A, i.e., 
the set of words accepted if p is chosen as initial state. Likewise for P Q 
denotes the set of words accepted if P is chosen as set of initial states. Recall 
that an automaton synchronizes ( on state p) ii there exists a word w such that 
Q ■ w = p. A synchronizes completely if it synchronizes on all its states. A non- 
empty set P C Q is rich in A if ripgpM.4 ~ {Q ~ P\ A ^ 0- The automaton 
is rich if all its states are rich. The reversal A°^ of an automaton is obtained 
by replacing all transitions p—^q by q—^p. Similarly, denotes the reversal 
of a word x. Note that is a 1-permutation automaton whenever A is. Since 
X € |g]^ if, and only if,q€Q- x°^ in A°^ we have the following proposition. 

Proposition 1. State p is rich in A if, and only if, A°^ synchronizes on p. 
More generally, P <ZQ is rich in A if, and only if, P is reachable in A°^ . 

Hence, if synchronizes completely the full power automaton of A is 
reduced, and its accessible part pow(A) is the minimal automaton, so 7 t(A) = 

p{A). 

To simplify notation, let Aq be a transitive permutation automaton over the 
alphabet {a, 6} and fix a transition r = {a,b,(3) in Ao- We denote A the 1- 
permutation automaton obtained by dipping the label of that transition to a. 
Thus, locally the 1-permutation automaton A has the following structure. 




Proposition 2. Let A be a 1-permutation automaton. Then A synchronizes on 
a, and /3 is rich in A. 

Proof. Suppose P ^ Q has cardinality larger than 1. Define the standard length- 
lex order on words to be the product order where words a first compared by 
length, and then each group of words of the same length is ordered lexicograph- 
ically. Let X be length-lex minimal such that a G P ■ x. Then | P • | = | P | — 1 

and A synchronizes on a by induction. Since A°^ is also a 1-permutation au- 
tomaton it follows by the same argument and proposition [T] that fd is rich in 

A. □ 

Proposition [T] allows us to demonstrate blow-up of machine A by arguing 
about the richness of sets of states in A°p. To this is end it is convenient to 
think of computations as moving pebbles according to some input sequence. For 
example, to show that p is rich we can place a red pebble on p and a black 
pebble on each state in Q — p. We then have to remove all black pebbles (by 
moving them to a and then firing a b transition), without losing the red one. A 
loss could occur because the red pebble is located at a and the next symbol is 
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b, or because the red pebble moves to /3, but a black pebble arrives there at the 
same time. Of course, both red and black pebbles will split when located at a 
and the next symbol is an a, so there may be several red pebbles after a while. 

To demonstrate this approach, consider an automaton A defined on a circu- 
lant graph C(ri] l,d) where 1 < d < n. Figure [T] shows a 1 -permutation automa- 
ton based on (7(6; 1 , 2 ). 




Fig. 1. A 1-permutation automaton on the circulant graph (7(6; 1, 2). 

In this case, A and 71°^ are isomorphic. To see that an arbitrary non-empty 
set P C Q is rich place red pebbles on all the elements of P, and black pebbles on 
all the states in Q—P. We can now fire a sequence of input symbols a until a black 
pebble is moved to a, and then fire b. Note that no red pebbles are lost, even if 
the second incarnation of a pebble being placed onto P is eliminated by a black 
pebble arriving there at the same time. Hence, we can remove all black pebbles. 
It follows that 7 r( 7 l) = 2" and, as we will see shortly, /i(7l) = 2”. However, the 
size of the minimal Fischer automaton is 771(2"/™ — I) where m = gcd(n, d— I): 
the states in P C Q reachable from a single state p all have distances a multiple 
of c? — 1, and all such P are indeed reachable. 

As a first step towards our main result, let us first dispense with the case when 
two of the states a, /3, <5 or 7 coincide. Let us call a 1-permutation automaton 
degenerate if either a = /3 or <5 = 7. Thus loop- 1-permutation automata in 
particular are degenerate, and degenerate 1 -permutation automata locally have 
the following structure. 




Theorem 2. Let A be a degenerate 1 -permutation automaton of size n. Then 
Ai(A) = 7 t(A) = 2". 
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Proof. By theorem [T] we only have to deal with the case <5 = 7 . In theorem [3] we 
will show that so it suffices to show that in every subset of Q 

is rich. But is again degenerate, so to simplify notation we will show that 
every subset of Q is rich in ^ = 71°^°’’. To this end we show that for every R 
disjoint from P we have - [pJa ^ 0. 

Let A be the a-labeled cycle in Ao that contains the consecutive edges 
a-^ S = 7 -^/ 3 . Note that in A there is a chord a in this cycle. Define 
a partial order on the power set of Q by P ^ P' if | P | < | P' |, or if | P | = | P' | 
and the length-lex minimal word x such that a £ P-x precedes the corresponding 
word for P' , again in length- lex order. 

Suppose P is ^-minimal such that for some Pn P = 0 we have PlrGii 
[P]^. Place red pebbles on R and black pebbles on P accordingly. By the min- 
imality of P we must have a ^ P. Let x be the minimal witness such that 
a € P ■ X. Then x = bu and there is a red pebble on a, otherwise there would 
be a violation of the minimality of P. 

Let m be the least common multiple of the lengths of all a-labeled cycles 
in ^ 0 - Move the pebbles according to a™. Since P is minimal, no black pebble 
can be on A, so that all black pebbles return to their original positions. The 
red pebble originally at a now has a clone at b. Hence we can now use h to 
move the black pebbles closer to a without destroying a red pebble (and all its 
descendants). Since | P | = | P • a^b \ this contradicts our minimality assumption. 

□ 

Note that no claims are made about the size of the minimal Fischer au- 
tomaton in the last theorem. It is not true in general that = 2" — 1, 

though in the special case of de Bruijn automata this relation obtains: in this 
case 7 = 5 = 0*^ or 1^, so there is a loop at 7 = <5 labeled b. From this it follows 
easily that A synchronizes on (3, thus Q is reachable from a in A. 



3 Reduced Power Automata 

We will now show that the power automata arising from 1-permutation automata 
are always reduced. A~ denotes the reversible automaton obtained by removing 
the transition r from Aq. 

Lemma 1. Let A be a 1-permutation automaton where both (a,a,b) o.nd 
( 7 , 0 ,/?) lie in the same strongly eonneeted eomponent of A~ . Then A is rieh. 

Proof. First consider a cycle C of the form j3 — > a 5 — > 7 /3. We claim 

that every point q on C is rich. To see this let P C Q where q ^ P, and place 
red and black pebbles on Q correspondingly. We may safely assume that q ^ (3 
by proposition [21 

If there is no black pebble on a fire symbol s to advance the pebble from q 
one step towards f3. If there is a black pebble on a pick i > 0 such that either 
q ■ P = q or q ■ P = a depending on whether the red pebble is currently located 
on a 6 -cycle or on the open path labeled b ending at a. In both cases, we either 
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remove at least one black pebble or we bring the red pebble closer to (3 on C. 
Our claim follows by induction. 

But then every state is rich: we can move a red pebble from anywhere to the 
cycle C. □ 

Accessibility plays no role in the last argument, so we have the following 
corollary. 

Corollary 1. Let A he as in the last lemma. Then the full power automaton of 
A is redueed. In partieular, p.{A) = tt{A). 

From now on assume that (a, a, <5) and ( 7 , a, (3) lie in two separate transitive 
subautomata As and Ap determined by the strongly connected components of 
S and f3 in A~ . A typical example of this situation is shown in figure El The 
subautomaton As here has states {a, <5} and all other states lie in Ap. 




Fig. 2. A 1-permutation automaton with 5 covered by {13,13'}. 

In this automaton the behavior of 5 is strictly contained in the behavior 
of {(3,(3'}, and the full power automaton fails to be reduced. However, we will 
see that {(3,(3'} is not reachable, and the power automaton is still reduced. To 
see why, consider P\ ^ P 2 Q but [Pi ].4 = [^ 2 ]^- Without loss of generality 
assume p G P\ — P 2 . Let x be length- lex minimal such that p ■ x = a. Since (3 
is rich we must have j G P 2 ■ x: otherwise (3 ^ p ■ xa but (3 ^ P 2 ■ xa. Moreover, 
S G p • xa and we must have (3 G P 2 ■ xau for any u such that S ■ u = S. This is 
captured in the next definition. P C Q covers S if 6 ^ P but |<5]^ C [P]^. A 
cover P is minimal if it is minimal with respect to cardinality. 

Thus, covers form the essential obstruction to the full power automaton being 
reduced. We will now describe the structure of minimal covers in great detail 
so as to show that these sets cannot appear in the accessible part of the power 
automaton. 

Proposition 3. Let P he a minimal cover of S and x in the behavior of 6. Then 
\P-x\^\P\. 

Proof. By minimality, the cardinality of P • a: cannot decrease for any x G [5]^. 
Assume for the sake of a contradiction that | P • a; | > | P |. Let u G be the 
minimal prefix of x such that a G P ■ u. Note that a ^ S ■ u since no merge can 
occur on a. But then uh G [<5]^ whereas \ P ■ ub\ < | P |, contradiction. □ 
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Proposition 4. Let P be a minimal eover of <5 and S = 6 ■ x. Then P ■ x is a 
minimal eover and x aets like a permutation on P. 

Proof We have [<5]^ = C and therefore [(5]_4 C [P ■ x\a- The 

first claim follows from the last proposition. 

Since 5 = 5 ■ x"^ for alH > 0 it follows that P • x* is a minimal cover for all i. 
But then P ■ x'" = P ■ x'‘^^ for some i > 0, j > 0. Hence the orbits are labeled x^ 
and our claim follows. □ 

Lemma 2. Let P be a minimal cover of 6. Then 

P = {P-x\S-x = S in A~ } 

Proof. Denote Po = {/3-x\S-x = S in A~ } and pick x ^ e such that 6 • x = 6 
in A~ ■ Clearly x = ua and, by proposition!^ x permutes P. But S ■ u = a, so 
'Y G P ■ u and f) G P ■ x. It follows that Pq C P and it suffices to show that Pq is 
a cover. So suppose u is in the behavior of 5. Then for some suitable suffix v we 
have 5 G 5 ■ {uvY for all i, and Pq contains a cycle 

P = PO Pi P2 • ■ • Pr-l Pr = P- 

But then u is in the behavior of Pq. □ 

Since both subautomata As and Ap are reversible we have a natural partial 
action of the free group over {a, b} on their state sets, see [T] for details. To avoid 
confusion with our tt function we write F(^5,c5) for the fundamental group of 
As, and F(^^, P) for the words in the free group over {a, b} that permute P. 

Lemma 3. Let P be a minimal eover of 5. Then F(Pl5,5) =F(Pl^,P). 

Proof We first show F(Pl5, (5) C F(y^,g, P). We have {a, 6}*nF(^5, c5) C¥{A/ 3 , P) 
from proposition!^ For the sake of simplicity, we consider only the case where 
X G F(Pl5,(5) contains only one symbol a, the general case is entirely analogous. 
So, assume x = uaw where u,w G {a, 6}*. Since As is strongly connected we can 
choose a word v G {a, b} such that 6 ■ uw = 6 ■ ud. But then u{waYwv permutes 
P for all z > 0. For any p G P set Pp = p ■ u, so that pp ■ wu = f{p) for some 
permutation /. wa permutes the set {pp \ p G P}. Hence, qp - a = qp ■ {waYw 
for some z > 0 and it follows that udv permutes P. 

For the opposite direction let x G F(Pl/3, P) and set q = <5 • a;. Since P G P ■ x 
we must have q = S ■ P tor some z > 0, so that € F(^a, S) C F(Pl/3, P). But 
then xb permutes P, whence z = 0 and we have x G F(Pl5, <5). □ 

Theorem 3. Let A be a 1 -permutation automaton. Then the accessible part of 
the power automaton is reduced. Hence p.{A) = 7t(^). 

Proof. By lemma [T] we may safely assume that {a,a,P) and (j,a,S) do not lie 
in the same strongly connected component of A~. 

First consider the case when 6 has no cover, i.e., when 6 is rich. Since there 
is a path p S from any state p other than P that avoids P it follows that A 
is rich and we are done as in corollary!!] 
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So suppose Pi ^ P 2 Q but [Pi]^ = [P 2 ]^- Without loss of generality 
assume p S Pi — P 2 . As mentioned previously, we must have p ■ xa = S and 
P 2 • xa covers <5. But covers are not reachable. To see this, assume otherwise, say 
P — Q ■ X for some cover P. Since S ^ P and all states in As°^ are complete, x 
must be of the form x = vbu where S' -bu = 5. But then u G F(A 5 ), so P = Q-vb 
by lemma |3] contradicting the fact that /3 is in P by lemma [2l 

Hence P 2 cannot be reachable either, and we are done. □ 

4 Open Problems 

As pointed out in the introduction, any cellular automaton whose associated 
FET language has maximum complexity must be a 1-permutation automaton. 
Some of these automata are accounted for by theorem|^ However, computational 
experiments suggest that exactly half of the 1-permutation automata associated 
with binary cellular automata of width w demonstrate full blow-up: 

Conjecture 1. There are 1-permutation automata de Bruijn automata 

A such that 

p{A)= pf(.A) + 1 = 2'^'^~\ 

It seems that the size of the power automaton of a 1-permutation automaton 
A is usually invariant under reversal, the automaton in figure [2| being a case 
in point. There are exceptions, but we are unable to characterize them at this 
point. 

Conjecture 2. In general, for a 1-permutation automaton A we have p{A) = 
p{A°P). 

Generalizing 1-permutation automata to fc-permutation automata in the ob- 
vious way one can see that the corresponding cellular automata tend to have 
smaller fi and tt values with increasing k. We do not presently understand the 
nature of this correlation. 
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Abstract. A right [left] locally testable language S' is a langnage with 
the property that for some nonnegative integer k two words u and v in 
alphabet S are eqnal in the semigronp if (1) the prefix and suffix of the 
words of length k — 1 coincide, (2) the set of segments of length k of the 
words as well as 3) the order of the hrst appearance of these segments in 
prefixes [suffixes] coincide. 

We present necessary and sufficient condition for graph [semigroup] to be 
transition graph [semigronp] of the deterministic hnite automaton that 
accepts right [left] locally testable language and necessary and sufficient 
condition for transition graph of the deterministic finite automaton 
with locally idempotent semigroup. We introduced polynomial time 
algorithms for the right [left] local testability problem for transition 
semigroup and transition graph of the deterministic hnite automaton 
based on these conditions. Polynomial time algorithm verihes transition 
graph of automaton with locally idempotent transition semigroup. 

Keywords: language, locally testable, deterministic hnite automaton, 
algorithm, semigroup, graph 



1 Introduction 

The concept of local testability was first introduced by McNaughton and Papert 
jl2| and by Brzozowski and Simon j^. This concept is connected with languages, 
finite automata and semigroups and has a wide spectrum of generalizations. 

The necessary and sufficient condition for local testability were investigated 
for both transition graph and transition semigroups of the automaton 0, m, 
PI, m- The polynomial time algorithms solve the problem of local testability 
for transition graph m and for transition semigroups of the automaton m- 
They are polynomial in terms of the size of the semigroup or in the sum of 
nodes and edges. 

Right [left] local testability was introduced and studied by Konig |S] and by 
Garcia and Ruiz [7] . These papers use different definitions of the conception and 
we follow |7] here 

Theorem 11 ^ A finite semigroup S is right [left] locally testable iff it is locally 
idempotent and locally satisfies the identity xyx = xy [xyx = yx[. 
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For conception of local idempotency see, for instance, |6]. The varieties of semi- 
groups defined by considered identities are located not far from atoms in the 
structure of idempotent varieties |4]. 

We present in this work necessary and sufficient condition for right [left] lo- 
cal testability for transition graph of the DFA and for the local idempotency 
of the transition semigroup on the corresponding transition graph. We improve 
necessary and sufficient condition for right [left] local testability from for tran- 
sition semigroup. On the base of these results, we introduced a polynomial time 
algorithm for the right [left] local testability problem for transition semigroup 
and transition graph of the deterministic finite automaton and for checking the 
transition graph of the automaton with locally idempotent semigroup. 

These algorithms are implemented in the package TESTAS. The package 
checks also whether or not a language given by its minimal automaton or by syn- 
tactic semigroup of the automaton is locally testable, threshold locally testable, 
strictly locally testable, or piecewise testable [IS]- 



2 Notation and Definitions 

Let E be an alphabet and let 27+ denote the free semigroup on 27. If w S 27+ , let 
[ru] denote the length of w. Let /c be a positive integer. Let ik{w) [tfe(w)] denote 
the prefix [suffix] of w of length fc or w if jrcj < k. Let Fk{w) denote the set of 
segments of w of length k. A language L [a semigroup 5'] is called right [left] 
k-testahle if there is an alphabet 27 [and a surjective morphism ([ : 27+ — > S] such 
that for all u, v G 27+, if ik-i(u) = ik-i(v),tk-i{u) = tk-i(v), Fk{u) = Fk{v) 
and the order of appearance of these segments in prefixes [suffixes] in the word 
coincide, then either both u and v are in L or neither is in L [ucf = v])]. 

An automaton is right [left] k-testable if the automaton accepts a right [left] 
fc-testable language. 

A language L [a semigroup S', an automaton A] is right [left] locally testable 
if it is right [left] fc-testable for some fc. 

jSj is the number of elements of the set S. 

A semigroup S is called semigroup of left [right] zeroes if S satisfies the 
identity xy = x [xy = y] . 

A semigroup S has a property p locally if for any idempotent e G S the 
subsemigroup eSe has the property p. 

So a semigroup S is called locally idempotent if eSe is an idempotent sub- 
semigroup for any idempotent e G S. 

A maximal strongly connected component of the graph will be denoted for 
brevity as SCC, a finite deterministic automaton will be denoted as DFA. A 
node from an SCC will be called for brevity as an SCC — node. SCC-node 
can be defined as a node that has a right unit in transition semigroup of the 
automaton. 

\F\ denotes the number of nodes of the graph F. 
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denotes the direct product of i copies of the graph F. The edge 
(pi, ^ (qi, ...,qn) in T* is labelled by a iff for each i the edge Pi — > q^ 

in r is labelled by a. 

The graph with only trivial SCC (loops) will be called acyclic. 

If an edge p — > q is labelled by cr then let us denote the node q as per. 

We shall write p ^ q if the node q is reachable from the node p or p = q 
(p q for distinct p, q). 

In the case p ^ q and q p we write p ~ q (that is p and q belong to one 
SCC or p = q). 



3 Transition Graph of Deterministic Finite Automaton 

3.1 Graph of DFA with Locally Idempotent Transition Semigroup 

Lemma 31 Let S he the transition semigroup of a deterministic finite automa- 
ton and let F he its transition graph. Let us suppose that for three distinct nodes 
p, q, r from F the node (p, q, rj in F^ is SCC -node, and (p,q) (q, rj ^ 

Then S is not locally idempotent. 

Proof. Let us suppose that for the nodes p, q, r from F the conditions of 
lemma hold. Therefore the nodes p, q, r have a right unit e = e^, whence pe = 
p, qe = q, re = r. In view (p, q) (qT); there exists an element s G S 
such that ps = q and qs = r. Therefore pese = q and qese = r, whence 
p(ese)^ = r yf q = pese. So p(ese)^ yf pese and (ese)^ yf ese. Semigroup eSe is 
not an idempotent semigroup and therefore S is not locally idempotent. 

Lemma 32 Let S he the locally idempotent transition semigroup of a deter- 
ministic finite automaton and let F he its transition graph. 

For any SCC-node (p, qj G and s G S from ps ^ q follows qs ^ q. 

Proof. Let us consider S'CC'-node (p,q) from such that ps ^ q. The 
node (p, q) has a right unit e = e^, so pe = p, qe = q. For some b G S we have 
ps6 = q. We can assume s = es, b = be. esbe = (esbe)^ in locally idempotent 
semigroup S. Therefore q = pes6e = p(es6e)^ = qes&e = qs6e. Thus we have 

qs ^ q. 

Lemma implies 

Corollary 33 Let S be the locally idempotent transition semigroup of a deter- 
ministic finite automaton and let F be its transition graph. 

Let us suppose that in F^ we have (p,q) (q^j and the node (p, qj is an 
SCC-node. Then r ~ q. 



Lemma 34 Let S he transition semigroup of a deterministic finite automaton 
and suppose that in F^ we have (p, q) (q, p) for two distinct nodes p, q. 

Then S is not locally idempotent. 
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Proof. We have ps = q and qs = p for some s £ S. So ps^ = P 7^ ps = q 
and p = ps^" 7^ ps^”“^ = q. Therefore s^" 7^ for any integer n because 

of p 7^ q. Finite semigroup S contains therefore non-trivial subgroup, whence S 
is not locally idempotent. 

Let us formulate the necessary and sufficient conditions for graph to be tran- 
sition graph of DFA with locally idempotent transition semigroup. 

Theorem 35 Transition semigroup S of a deterministic finite automaton is 
locally idempotent iff 

1- (PjQ) ^ distinct nodes p, q, 

2. for any SCC-node (p, qj S and s G S from ps F q follows qs F q and 

3. for any SCC-node (p, q, rj of T^ with distinct components holds (p, q) ^ 
(q, rj in F^. 

Proof. If S is locally idempotent then the condition 1 follows from lemma l34l 
condition 2 follows from lemma condition 3 follows from lemma 

Suppose now that S is not locally idempotent. Then for some node p from F, 
idempotent e and element s from S we have p(ese)^ 7^ pese. Hence pe 7^ pese 
and at least one of two nodes p(ese)^, pese exists. If exists the node p(ese)^ then 
the node pese exists too. So pese exists anyway. Therefore pe exists too and from 
(pe,pese)ese = (pese, p(ese)^) in view of condition 2 follows p(ese)^ pese, 
whence the node p(ese)^ exists. 

The node (pe, pese, p(ese)^) is an S'C'C'-node of F^ because all components 
of the node have common right unit e. Let us notice that p(ese)^ yf pese and 
pe yf pese. We have (pe,pese) (pese, p(ese)^). In the case pe = p(ese)^ we 
have contradiction with condition I, in opposite case we have contradiction with 
condition 3. 



3.2 Right Local Testability 

Theorem 36 Let S be transition semigroup of deterministic finite automaton 
with state transition graph F. Then S is right locally testable iff 

1. for any SCC-node (p,qj from F^ such that p ~ q holds p = q. 

2. for any SCC-node (p,qj G F^ and s G S from ps F q follows qs F q. 

Proof. Suppose semigroup S is right locally testable. 

Condition 1. Let (p,q) be an S'C'C'-node with distinct components. Then for 
some idempotent e G S holds (p, q)e = (p, q). If p ^ q then for some a,b G S 
holds qa = p and pb — q, whence qeae = p and pe6e = q. So qeae6e = q and 
pe6eae = p. Semigroup S is right locally testable and therefore the subsemigroup 
eSe satisfies identity xyx = xy [7j. Consequently, q =qeae&e = qeaebeae = 
pe6eae = p. 

Condition 2 follows from lemma 13^ because right locally testable semigroup 
S is locally idempotent. 

Suppose now that both conditions of the theorem are valid. Let us begin 
from the local idempotency of S. 
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If the identity = a; is not valid in eSe for some idempotent e then for some 
node V G r and some element a G S we have veae yf veoeae. At least one of 
two considered nodes exists. In view of ve ^ veae ^ veaeae the nodes veae, ve 
exist. Let us denote p = ve, q = veae. Therefore (p, q) is an S'C'C-node. Notice 
that peae ^ q. Hence, by condition 2, qeae ^ q. Now, by by condition I, in view 
of q qeae, we have qeae = q. So veae = veaeae in spite of our assumption. 

Thus the transition semigroup S is locally idempotent. 

If the identity xyx = xy [2] is not valid in eSe then for some node v G F, 
some idempotent e and elements a,b G S holds 'veaebe yf veae&eae. So the node 
veaebe exists. Let us denote p = veaebe. S is locally idempotent and therefore 
p = veaebeaebe. Consequently, the node q = veaebeae exists too. We have p yf 
q. The node {veaebe, veaebeae) = (p, q) is an S'C'C-node from It is clear that 
p = veaebe ^ veaebeae = q. Then q = veaebeae ^ veaebeaebe = veaebe = p. 
So p ^ q and p yf q in spite of the condition 1. 



3.3 Left Local Testability 

Lemma 37 Let reduced DF A A with state transition graph F and transition 
semigroup S be left locally testable. Suppose that for SCC-node (p,qj of F^ 
holds p q. 

Then for any s G S holds ps ^ q iff qs >z q. 

Proof. Suppose A is left locally testable. Then the transition semigroup S of 
the automaton is finite, aperiodic and for any idempotent e G S the subsemi- 
group eSe is idempotent j^. 

For some a,e = G S holds pa = q, (p, q)e = (p, q). So we have pes = ps 
and qes = qs. 

If we assume that ps ^ q, then for some b from S holds ps6 = q, whence 
pes6e = q. In idempotent subsemigroup eSe we have esbe = {esbe)^. Therefore 
qes6e = p(es6e)^ = pes&e = q and qes = qs ^ q. 

If we assume now that qs ^ q, then for some d G S holds qsde = q. For some 
a G S holds pa = q because of p ^ q. So qsde = qesde = q and peaesde = q. 
The subsemigroup eSe satisfies identity xyx = yx, therefore eaesde = esdeaesde. 
So q = peaesde = pesdeaesde. Hence, pes = ps ^ q. 



Lemma 38 Let reduced DF A A with state transition graph F be left locally 
testable. 

If the node (p, q, is an SCC-node of F^ , fp,r) ^ (q,rj and (p, q) ^ (r, qj 
in then r = q. 
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Proof. Suppose A is left locally testable. Then the transition semigroup S of 
the automaton is finite, aperiodic and for any idempotent e G S the subsemi- 
group eSe is idempotent |7]. 

Let us consider the nodes p,q,r from F such that the conditions of lemma 
are valid for them. From (p,r) ^ (q, r) and (p, q) ^ (r, q) follows (p,r)s = 
(q, r) and (p,q)t = (r,q) for some s,t G S and (p, q, r)e = (p,q, r), for some 
idempotents e G S. We can take s,t from eSe. Therefore 

ese = s, ete = t, s'^ = s,t'^ = t 

So ps = q, rs = r, pt = r, qt = q. Let us notice that qs = ps^ = ps = q. 
Analogously, rt = r. 

We have psts = qts = qs = q. Then pts = rs = r. The identity xyx = yx is 
valid in subsemigroup eSe, whence q = psts = pts = r. 

Let us formulate the necessary and sufficient conditions for graph to be tran- 
sition graph of DFA with left locally testable transition semigroup. 

Theorem 39 Let S be transition semigroup of a deterministie finite automaton 
with state transition graph F. 

Then S is left locally testable iff 

F S is locally idempotent, 

2. for any SCC-node (p,q) of F^ such that p ^ q and for any s G S we 
have ps F q iff qs F q and 

3. If for arbitrary nodes p,q, r G T the node (p,q,r^ is SCC-node of F^, 
(p, r) ^ (q^rj o,nd (p, q) ^ (r, qj in F'^ , then r = q. 

Proof. Suppose semigroup S is left locally testable. Then S is locally idem- 
potent jZ]. Second and third conditions of our theorem follow from lemmas EZ] 
andEHl correspondingly. 

Suppose now that the conditions of the theorem are valid but for an arbitrary 
node p, an arbitrary idempotent e G S and two elements s,t G eSe holds psts yf 
pts. By condition 1, 

s^ = s, t^ = t, tsts = ts, stst = st, tssts = ts 

At least one of two nodes psts = q and pts = r exists. Therefore pe exists 
too. We have (pe,pts)sts = (psts, pts). Therefore the existence of the node 
pts = r implies by condition 2 the existence of the node psts = q. Analogously, 
from (pe,psts)ts = (pts, psts) and existence of the node psts = q follows by 
condition 2 the existence of the node pts = r. 

The node (pe, q, r) is an ^CC-node because all his components have common 
right unit e. We have (p,r)sts = (psts, ptssts) = (q, pts) = (q, r). Analogously, 
(p,q)ts = (pts,pststs) = (r,psts) = (r,q). Thus, 

(pe,r) ^ (q,r), (pe,q) ^ (r, q) 

Now by the third condition of the theorem, r = q. Therefore psts = pts. 
The node p is an arbitrary node, whence sts = ts for every two elements s,t G 
eSe. Consequently, the subsemigroup eSe satisfies identity xyx = yx. Thus the 
semigroup S is left locally testable. 
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4 Semigroups 

Lemma 41 Let S be a finite locally idempotent semigroup. The following two 
conditions are equivalent in S: 

a) S satisfies locally the identity xyx = xy (S is right locally testable). 

b) No two distinct idempotents e, i from S such that ie = e,ei = i have a 
common right unit in S. That is, there is no idempotent f G S such that e = ef 
and i = if. 

Proof. Suppose the identity xy = xyx is valid in subsemigroup uSu for any 
idempotent u and for some idempotents e, i in S' we have ie = e, ei = i. Suppose 
f is a common right unit of e, i. The identity xyx = xy in fSf and equality 
ei = i imply i = ei = efefif = efefifef = eie = e. Thus the idempotents e, i 
are not distinct. 

Suppose now that uSu does not satisfy the identity xyx = xy for some 
idempotent u. Notice that uSu is an idempotent semigroup. So for some a, b of 
S, uaubuau yf uaubu. For two distinct idempotents i = uaubuau and e = uaubu 
with common right unit u we have ie = uaubuauuaubu = uaubuaubu = uaubu = 
e and ei = uaubuaubuau = uaubua = i. 

So two distinct idempotents e, i from S such that ie = e,ei = i have a 
common right unit u in S. 

The following lemma is proved analogously: 

Lemma 42 Let S be a finite locally idempotent semigroup. The following two 
conditions are equivalent in S: 

a) S satisfies locally the identity xyx = yx (S is left locally testable). 

b) No two distinct idempotents e, i from S such that ie = i,ei = e have a 
common left unit in S. That is, there is no idempotent f G S such that e = fe 
and i = fi. 

Recall that a semigroup A is a right [left] zero semigroup if A satisfies the 
identity xy = y[xy = a:]. A right [left] locally testable semigroup is locally 
idempotent [3. Then from the last two lemmas follows 

Theorem 43 A finite semigroup S is right [left] locally testable iff S is locally 
idempotent and no two distinct idempotents e, i from right [left] zero subsemi- 
group have a common right [left] unit in S. 

5 An Algorithm for Semigroup 

The following proposition is useful for the algorithm. 

Lemma 51 \W] Let E be the set of idempotents of a semigroup S of size n 

represented as an ordered list. Then there exists an algorithm of order that 
reorders the list so that the maximal left ]right] zero subsemigroups of S appear 
consecutively in the list. 
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1. Testing whether a finite semigroup S is right [left] loeally testable. 

Suppose [S'! = k. We begin by finding the set of idempotents E. This is a 
linear time algorithm. Then let us verify local idempotency. For every e € E and 
every s G S' let us check condition ese = (ese)^. If the condition does not hold 
for some pair, the semigroup is not locally idempotent and therefore not right 
locally testable (theorem 02 ]) . This takes O(fc^) steps. 

Now we reorder E according to lemma ED in a chain such that the subsemi- 
groups of right [left] zeroes form intervals in this chain. We note the bounds of 
these intervals. We find for each element e of E the first element i in the chain 
such that e is a right [left] unit for i. Then we find in the chain the next element 
j with the same unit e. If i and j belong to the same subsemigroup of right [left] 
zeroes we conclude that S is not right [left] testable iLemma 14111 and stop the 
process. If they are in different right [left] zero semigroups, we replace i by j and 
continue the process of finding a new j. This takes 0{k‘^) steps. 

Finding the maximal subsemigroup of right [left] zeroes containing a given 
idempotent needs k steps. So for to reorder E we need at most k'^ steps. The 
time and the space complexity of the algorithm is 0{k^). 

6 Graph Algorithms 

Let n be the sum of the nodes and edges of E. The first-depth search m, HH] 
or ca) will be used for SCC search, for reachability table for triples and for 
checking condition 2 of theorems ESI and ES 

Table of reachability for triples. 

Suppose SCC of T, T^ and the table of reachability are known. For every 
S'CC'-node q of the graph T let us form by help of the first-depth search on 
the following relation L [I] on T: pLr if (p,q) ^ (r,q) [pdr if (p, q) ^ (q, r)]. 
For every node (p, q) we form set of nodes r such that pLr [p/r]. We use an 
auxiliary array for this aim: for every node (p, q) and for every node s, we form 
set of pointers to nearest successors (t,s) [(s,t)] of (p, q). 

If (p, q, r) is an S'CC-node with distinct components and pLr [p/r] then we 
add the triple (p, q, r) to the set Left [LocId\. (O(n^) time and space complex- 
ity)- 

6.1 Graph of Automaton with Locally Idempotent Transition 
Semigroup 

The algorithm is based on the theorem ESI Let us recognize the reachability 
on the graph E and form the table of reachability for all pairs of E. The time 
required for this step is 0{\E\‘^). 

We find graph T^ and all SCC of the graph {0{n^) time complexity). If the 
nodes (p,q) and (q, p) belong to common SCC then the transition semigroup 
is not locally idempotent (condition 1). 

For check the condition 2 of the theorem let us add to the graph new 
node (0,0) with edges from this node to every S'CC-node (p, q) from such 
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that p ^ q. Let us consider first-depth search from the node (0, 0) (the unique 
starting point of any path). 

Let us fix the node q after going through the edge (0, 0) — > (p, q). We do not 
visit edges (r,s) ^ (r,s)cr such that rcr ^ s. In the case that for the node (r,s) 
from two conditions rcr ^ q and scr ^ q only the first is valid the condition 2 does 
not hold, the transition semigroup is not locally idempotent and the algorithm 
stops. 

Let us find graph all SCO of the graph F^ and mark all ^CC-nodes 
with three distinct components such that the first component is ancestor of two 
others. (O(n^) time complexity). 

Let us go to the condition 3 of the theorem E51 We form a table of triples 
Locid (see algorithm for table of reachability above). If some 5'C'C-node (p, q, r) 
from F^ with distinct components belongs to Locid then the condition 3 does 
not hold and the semigroup is not locally idempotent. 

The whole time and space complexity of the algorithm is O(n^). 



6.2 Right Local Testability of DFA 

The algorithm is based on the theorem Let us form a table of reachability 
of the graph F, find all SCC of F, F^ and all S'C'C-nodes of F^. (0(n?) time 
complexity). 

Let us verify the condition 1 of the theorem. For every 5'C'C'-node (p,q) 
(P q) from F^ let us check the condition p ~ q. If the condition holds the 
automaton is not right locally testable. (O(n^) time complexity). 

For check the condition 2 of the theorem let us add to the graph F^ new 
node (0,0) with edges from this node to every S'C'C-node (p, q) from such 
that p q. Let us consider first-depth search from the node (0,0) (the unique 
begin of any path). 

Let us fix the node q after going through the edge (0, 0) (p, q). We do not 

visit edges (r,s) ^ (r,s)cr such that rtr ^ s. In the case that for the node (r,s) 
from two conditions rcr ^ q and scr ^ q only the first is valid the algorithm stops 
and the condition 2 does not hold. The automaton is not right locally testable 
in this case. (O(n^) time complexity). 

The whole time and space complexity of the algorithm is 0{n?). 



6.3 Left Local Testability of DFA 

The algorithm is based on the theorem EH Let us form a table of reachability on 
the graph F and find all SCC of F. Let us find F^ and all SCC of F^. (O(n^) 
time complexity). 

Let us check the local idempotency (O(n^) time complexity). 

For check the condition 2 of the theorem let us add to the graph new 
node (0,0) with edges from this node to every S'C'C-node (p, q) from F^ such 
that p F q. Let us consider first-depth search from the node (0,0). 
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We do not visit edges (r,s) ^ (r,s)cr such that rcr ^ s and scr ^ s. In the 
case that for the node (r,s) from two conditions rtr ^ s and scr ^ s only one is 
valid the algorithm stops and the condition 2 does not hold. 

Condition 3 of the theorem Let us find and all S'C'C-nodes of F^ 
(O(n^) time complexity). 

Let us recognize the relation on the graph F^ and find set Left of triples 
p, q, r such that (p, q) q) (see algorithm for table of reachability above). 

If for some 5(7C-node (p,u,v) of F^ both triples (p, u, v) and (p,v,u) 
belong to the set then the condition 3 does not hold, the automaton is not left 
locally testable and the algorithm stops. 

The whole time and space complexity of the algorithm is O(n^). 
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Abstract. Whale Calf is a parser generator that uses conjunctive gram- 
mars, a generalization of context-free grammars with an explicit intersec- 
tion operation, as the formalism of specifying the language. All existing 
parsing algorithms for conjunctive grammars are implemented - namely, 
the tabular algorithm for grammars in the binary normal form, the tab- 
ular algorithm for grammars in the linear normal form, the tabular al- 
gorithm for arbitrary grammars, the conjunctive LL, the conjunctive LR 
and the algorithm based on simulation of the automata equivalent to lin- 
ear conjunctive grammars. The generated C-I-+ programs can determine 
the membership of strings in the language and, if needed, create parse 
trees of these strings. 



1 Introduction 



Conjunctive grammars were introduced in [2j as a generalization of context-free 
grammars that allows the use of an explicit operation of conjunction within the 
formalism of rules, which has semantics of intersection of languages. 

Conjunctive grammars inherit all descriptive capabilities of context-free 
grammars (since context-free grammars can be viewed as a particular case of 
conjunctive grammars), while the additional operation can be used anywhere 
within the grammar. The rules in conjunctive grammars are of the form 

A —>«!&... &a„, (1) 



where A is a nonterminal symbol, n ^ 1 and ai are strings comprised of terminal 
and nonterminal symbols. Informally, a rule CD means that every string that can 
be separately derived from each can also be derived from A; the objects of the 
form A — > Oi are called conjuncts. A formal definition of conjunctive grammars 
can be found in the paper |7j published in this volume. 

Besides the obvious increase in descriptional capabilities in comparison with 
context-free grammars, it turns out that conjunctive grammars retain many 
attractive properties of theirs, such as efficient recognition algorithms and tree 
representation of derivations. 

The papers |2ldl4l5lt)l7] define several efficient recognition and parsing algo- 
rithms for conjunctive grammars, which are briefly described in Section |5]of this 
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paper. Most of these algorithms in this or that sense generalize some known 
context-free parsing methods; somewhat surprisingly, in almost every case the 
computational complexity of a generalization is the same as the complexity of 
its context-free prototype. This suggests that the new algorithms could be ap- 
plied for solving practical problems of syntax analysis, and gives a motivation 
for developing software implementing these algorithms. 

The parser generator Whale Calf, which is discussed in this paper, does 
implement all the algorithms introduced in 

2 Parsing Algorithms 

2.1 Algorithms for Grammars in Normal Forms 

Two of these were developed in the initial paper [2] to prove upper bounds for 
the complexity of conjunctive and linear conjunctive languages. They are: (i) 
the algorithm for grammars in the binary normal form and (ii) the algorithm 
for grammars in the linear normal form. 

It is known j2] that every conjunctive grammar can be transformed to the 
binary normal form and every linear conjunctive grammar can be transformed 
to the linear normal form. These two parsing algorithms are of entirely the- 
oretical value and are supported in Whale Calf for the sake of completeness. 
Both, similarly to the classical Cocke-Kasami- Younger context-free algorithm, 
construct the n x n upper-triangular matrix Tij = {A \ A G N, A =>* Ui . . . aj} 
of nonterminals that derive the substrings of the input string oi . . . a„. The con- 
struction starts from the elements Tn, . . . , and ends with Ti„; the string is 
then accepted if and only if the start symbol of the grammar is in Ti„. 

The binary normal form algorithm uses 0{rA) time and O(n^) space, while 
the linear normal form algorithm uses O(n^) time and 0(n) space. 

2.2 Tabular Algorithm for Arbitrary Grammars 

This algorithm is based upon the general idea of the well-known context-free 
parsing algorithm due to Graham, Harrison and Ruzzo, which, given an input 
string of length oi . . . a„, constructs an upper-triangular matrix of 

sets of dotted rules of the form A ^ a ■ (3 (where A a/3 is a rule of the 
grammar), such that A ^ a ■ P G tij if only if a derives the substring . . . Oj 
of the input string and at the same time S =^* a\ . . . aiA'y for some string 7. 

The generalizion of the GHR algorithm for the case of conjunctive grammars 
P] creates a similar matrix {ty} of dotted conjuncts of the same form A 
a ■ /3, such that there exists a rule A — > 71& . . . &a/3& . . . 7^- A dotted conjunct 
A — > a • /3 is in Uj if and only if a derives a^+i . . .aj and there exists a finite 
sequence of conjuncts Ct-i — > {0 < t ^ k, where fc ^ 0 is some number), 

such that Cq = S, Ck = A and there exists a factorization ai . . . Oi = Ui . . . Uk, 
where 7* derives Ut for every t. 

The algorithm is applicable to any conjunctive grammar, works in cubic time 
and uses quadratic space. If the grammar is linear - i.e., every rule are either of 
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the form A ^ w {w G S*) or of the form A uiBiViSz . . . SzUnB„Vn {n ^ 1, 
Ui,Vi G S* , Bi G N) - then the complexity is reduced to quadratic time and 
0{n) space. 

However, in order to achieve 0{n) space upper bound, one has to sacrifice 
the possibility of parse tree construction. The implementation of the algorithm 
in Whale Calf uses O(n^) time and space in the case of linear conjunctive 

grammars, but always allows to construct parse trees. 

Two methods of parse tree construction for this algorithm have been devel- 
oped in |1]. Using the bottom-up method, the tree is constructed along with the 
creation of the recognition matrix, and the algorithm is slowed down by a con- 
stant factor. The top-down method constructs the parse tree out of the matrix 
{tij} after the string is successfully recognized; this method works fast on “rea- 
sonable” grammars, but can use exponential time or even go into an infinite loop 
on “unreasonable” ones. Only the top-down method is implemented in Whale 
Calf parser generator. 



2.3 Top-Down Conjunctive SLL(fe) Algorithm 

A context-free strong LL(k) top-down parser attempts to construct the leftmost 
derivation of an input string, using k lookahead symbols to determine the rules 
to apply to nonterminals, and a pushdown store to hold the right parts of the 
sentential forms forming the derivation. Left parts of sentential forms are prefixes 
of the input string that are being compared with the input symbols and then 
discarded. 

The conjunctive generalization of this algorithm [31 uses tree- structured push- 
down to handle multiple branches of computation simultaneously, thus ensuring 
that substrings of the input string are derived from every conjunct of a rule. The 
parsing table used by this algorithm is a mapping from N x to the set of 
rules of the grammar, where denotes {w | w G E*, |i(;| ^ k}. 

In its deterministic case the algorithm is applicable to a subclass of conjunc- 
tive grammars. Although there exist grammars even for the simplest languages, 
on which the algorithm works in exponential time and uses exponential space, 
its complexity is nevertheless linear for the practical cases, which include the 
intersection closure of context-free LL(fc) languages. 

Similarly to context-free LL, the conjunctive LL parsing method can be im- 
plemented manually using a variation of the recursive descent technique; this 
kind of implementation is not supported in Whale Calf. 



2.4 Bottom-Up Conjunctive LR Algorithm 

The Generalized LR parsing algorithm for context-free grammars, introduced 
by Tomita in 1986, is a polynomial-time implementation of nondeterministic 
LR parsing that uses graph-structured stack to represent the contents of the 
nondeterministic parser’s pushdown for all possible branches of computation at 
a single computation step. 
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The same idea of graph-structured pushdown turns out to be suitable for 
parsing conjunctive grammars. While generalized LR uses graph-structured 
pushdown merely to simulate nondeterminism whenever it arises, the extension 
of this algorithm for the conjunctive case additionally relies on doing several 
computations at once in order to implement the conjunction operation. In order 
to reduce by a rule, it requires multiple paths corresponding to the conjuncts 
of the rule to be present in the graph at the same time. Instead of defining a 
particular way of constructing a parsing table, the algorithm was proved correct 
for any table that satisfies the requirements listed in |H], and an extension of 
context-free SLR(fc) method conforming to these requirements was developed. 
The latter method is used in the implementation in Whale Calf parser generator. 

Although internally the algorithm is somewhat different from the context-free 
generalized LR, it looks very much the same from the user’s side, and hence one 
could expect it to be as suitable for practical use as the context-free generalized 
LR has proved to be. 

The algorithm is applicable to any grammar and can be implemented to 
work in no more than 0{n^) time. In many common cases it is even faster: for 
instance, it is known to work in linear time for the Boolean closure of determin- 
istic context-free languages. The implementation used by Whale Calf has O(n^) 
complexity upper bound, but in practical cases it works better than the other 
implementation with worst case cubic time performance proposed in [Q. 

2.5 Automaton-Based Algorithm for Linear Conjunctive Grammars 

In the paper [S], a very simple family of cellular automata was found to be 
computationally equivalent to linear conjunctive grammars. Several practical 
algorithms for creating these automata and reducing their size were proposed in 
j7]. Once constructed out of grammars, these automata can then be simulated 
in (n^ -I- n)/2 -I- C elementary table lookup operations (where n is the length of 
the input string), using space of no more than n. 

Although the tabular algorithm for arbitrary grammars of jj] also works 
in 0{n'^) time and 0(n) space, the present algorithm is many times (constant 
times) faster and requires much less memory. 

3 Implementation 

Whale Calf software consists of two distinct components: the parser generator 
that converts a text-based description of a conjunctive grammar into a set of 
C-|— I- structures containing information necessary to parse the language, and a 
C-|— I- class library that uses this information to do the actual parsing. All the 
user has to do is to write one or more grammar files; use the Whale Calf parser 
generator to translate them into C-|— I- files, which will define a grammar object; 
define a parser object within the main program, passing the grammar object to 
its constructor; use a parser class method to load a string into the parser; and 
finally invoke one more method to recognize or parse the string. 
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3.1 Input File Format 

A Whale Calf input file consists of a declaration of the parsing method used, 
of some parameters for this parsing method (where applicable), of the list of 
terminal symbols and of the set of rules. The specification of the parsing algo- 
rithm is of the form algorithm=value, where value must be one of the follow- 
ing: tabular, SLLl, SLLk, LR, linear_automaton, binary, linear. The default 
parsing algorithm is tabular. 

If algor ithm=SLLk, then the value of the parameter k can be set using the 
statement of the form SLLk_lookahead_length=k; the default value is 2. The 
size of the parsing table grows exponentially with k. 

If algorithm=linear_automaton, then some parameters for au- 
tomaton creation procedure can be configured through variables 

automaton^ iltering_order and automatonJ iltering_tuple_size 

(these control state filtering; the default is order 2 filtering of pairs - 
i.e., both equal 2), and automaton_accessing_string_max_length and 
automaton_number_of _accessing_strings, that, once set, instruct Whale Calf 
to compile the list of accessing strings of states at the stage of parser generation, 
which might be useful for the user and proved to be useful for the author (see 
the table in 0 ). 

Consider the following input file for Whale Calf that denotes a grammar for 
the language {wcw \ w G {a, 6}*} |2| and instructs the parser generator to use 
the LR algorithm: 

algor itlim=LR ; 

terminal a, b, c; 



S -> C & D; 

C->aCa|aCb|b 
D->aA&aD I bB& 
A->aAa|aAb|b 
B->aBa|aBb|b 
E -> a E I b E I e; 



C a I b C b I c ; 
b D I c E; 

AalbAbIcEa; 

BaIbBbIcEb; 



Whale Calf will process this file and create two C-|— I- source files with the 
names of the form filename . cpp and filename. h. These two files, together 
with the runtime library files, whalecalf _rtl . cpp and whalecalf_rtl.h, form 
a parser for this language. 



3.2 Use of Generated Parsers 

All six algorithms are implemented in a uniform manner; the differ- 
ence is mainly between the names of the parser classes. These names 
are WhaleCalfTabularParser, WhaleCalfLLlParser, WhaleCalfLLkParser, 
WhaleCalf LRParser, WhaleCalfBinaryParser, WhaleCalfLinearParser and 
WhaleCalf Linear AutomatonParser. 
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It suffices to create a parser object, use the method readO to pass the input 
string to the parser, and then either call the method recognize () to determine 
whether the string is in the language, or the method parse () to construct the 
parse tree of the string. 

The string is supplied to the method readO either as a pair of pointers (or 
STL iterators) to integers, which mark the beginning and the end of the string, 
or as an std: : vector<int>. In both cases the terminal symbols are encoded 
as numbers, enumerated starting from zero in the order of their definition: for 
instance, if the declaration 

terminal a, b, c; 

is used, then the numbers 0, 1 and 2 correspond to the symbols o, b and c 
respectively. 

The method recognize () takes no arguments and returns a value of type 
bool, which is true if and only if the string is in the language. The method 
parse 0 also takes no arguments and returns a pointer to the top node of the 
constructed tree if the input string is in the language, or NULL pointer otherwise. 
The nodes of the tree are represented in memory using the following predefined 
data structure: 

struct TreeNode 

{ 

int rn; // rule number 

int tn; // terminal number 

int position; // of a terminal in the input string 

std: :vector<std: :vector<TreeNode *> > descendants; 

}; 



For internal nodes, the data member rn contains the number of the rule 
A — ^ so,o • ■ • So,lo-l & ■ • ■ & s m— 1,0 • ■ • — 1 (m ^1, 1, Sij G EU N) 



used at this node, tn equals -1, and descendants is a vector of size m, such 
that every descendants [i] (0 ^ i < m) is a vector of integers of size U, and for 
every j (0 ^ j < U) the element descendants [i] [j] contains a pointer to the 
descendant of the current node corresponding to the symbol Sij . 

For the leaves, rn is -1, while tn holds the number of the terminal symbol, 
position gives the position of this terminal in the input string and the vector 
of descendants is empty. 

The creation of parse trees is only supported in the LL, LR and Tabular 
parsing algorithms. In case of LL and LR, using the generated program in the 
parser mode slows down its execution by a constant factor. For the Tabular 
algorithm, the construction of a parse tree is preceded by ordinary recognition; 
as noted in Section 12.21 this construction works efficiently for most grammars, 
while for some grammars it can work very slow or even fail to terminate. 

The following example uses the parser given in Section 13.11 to construct the 
derivation tree of abcab according to the grammar for {wcw \ w G {a, 5}*}. 
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#include "grammar. h" // The generated header file. 

WhaleCalfLRParser whale_calf (Grammar : : grammar) ; 

int x[]={0, 1, 2, 0, ll; // Let the input string be "abcab" . 

whale_calf .read(x, x+5) ; // Any pair of STL iterators can come here. 

WhaleCalf : :TreeNode *tree=whale_calf .parseO ; 

if (tree) whale_calf .print_parse_tree(tree , of stream ("par se_tree . dot") ) ; 

The generated parsers are capable of exporting some of their static and dy- 
namic data structures, such as LR and LL parsing tables, tables constructed 
by the tabular algorithm, contents of tree-structured or graph-structured push- 
down, etc. Tables are printed as sources, while graphs are printed as text- 
based graph descriptions for consequent processing by the dot application from 
Graphviz graph drawing software PQ, which converts them directly to eps figures. 

The constructed parse trees can also be exported in dot format. For instance, 
the tree produced by the parser from the example above and converted to eps 
format is given in unaltered form in Figure [H 




Fig. 1. A parse tree constructed by LR parser {abcab € {u>cw | w € {a, b}*}). 
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4 Conclusion 

The present version of the program implements all existing parsing algorithms 
for conjunctive grammars and provides a primitive programmer’s interface to 
these algorithms. Although this interface lacks flexibility required for industrial 
programming, its simplicity can be valuable for a scientist interested in the very 
concept of conjunctive grammars. 

The software may be freely used for educational and research purposes and 
is available from 

http : //www . cs . queensu . ca/home/ okhotin/whalecalf / 

In the future versions it is planned to enhance the programmer’s interface 
of the generated parsers and thus make the program usable for practical parser 
generation. Although it cannot possibly rival context-free LALR(l) parser gen- 
erators in terms of the speed of the generated parsers, it is likely to be useful 
for the applications where general context-free parsing methods are now being 
used, because the algorithms implemented in Whale Calf have basically the same 
computational complexity, but are applicable to a wider class of languages and 
allow the use of a convenient logical operation of intersection. 

It is also planned to implement the results of ongoing research on a further 
extension of conjunctive grammars with negation. 
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Abstract. We present a system that performs computations on finite 
state machines, syntactic semigroups, and one-dimensional cellular au- 
tomata. 



1 Sample Computations 

The automata system facilitates computation on finite state machines, syntactic 
semigroups, and one-dimensional cellular automata. Unlike some other systems 
such as Grail, AUTOMATE, Amore, see |S] for detailed references, the automata 
package is a hybrid system that is built around a commercial computer algebra 
system. Specifically, the current implementation uses version 4.1 of Mathematica 
by Wolfram Research, Inc., see HD. Before commenting more on this approach, 
we present two typical sample sessions. Fairly detailed descriptions of earlier 
versions of the package can be found in m- 

1.1 Entropy of Sofic Shifts 

Suppose you wish to determine the entropy of the sofic subshift associated with 
a particular one-dimensional cellular automaton, see [51 Ij for more background 
information). Here is a short session in automata that shows the necessary cal- 
culations. The dialogue is captured the way it would appear in the plain text 
interface. For more elaborate examples using the notebook frontend, see my 
home page http://www.cs.cmu.edu/~sutner. The first command converts the 
elementary cellular automata number 92 into a de Bruijn semiautomaton, which 
is then converted into the corresponding minimal Fischer automaton, see [mo]. 
We extract the transition matrix from the latter, construed as a non-negative 
integer matrix, and determine its Perron eigenvalue. 
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sa = ToSA[ CA[ 92, 3, 2 ] ] ; 
mf = MinimalFischerFA [ sa ] 

SAC 8, 2, {{1, 1, 1}, {2, 1, 4}, {3, 1, 1}, {4, 1, 6}, 

{5, 1, 1}, {6, 1, 6}, {7, 1, 8}, {1, 2, 2}, {2, 2, 3>, 
{3, 2, 5}, {4, 2, 2}, {6, 2, 7}, {7, 2, 5}, {8, 2, 2» ] 

M = FullTransitionMatrixFA [ mf ]; 

Log [ 2, Max [ Abs [ N[ Eigenvalues [M] ] ] ] ] 

0.900537 



In the formatting mode chosen here the semiautomaton mf is shown in abbre- 
viated form. There are two invisible fields indicating the actual state set (a set 
of size 8 according to the first field in the automaton, see section below) , and 
the elements of the alphabet (a set of size 2 according to the second field in the 
automaton). The first three commands use operations defined in the package, 
whereas computation of the eigenvalues is handled entirely by Mathematica. 



1.2 Preserving Regularity 

As a second example, consider regularity preserving operations on regular lan- 
guages. There is a family of such operations based on existential quantification 
over strings of a certain length. In particular, a function / : N — > N is regularity 
preserving if for any regular language L the language 

T{LJ) = {xeS* I 3yeS*{\y\ = f{\x\)AxyeL)} 

is again regular. For specific functions /, the construction of the corresponding 
machines can be expressed easily in terms of Boolean matrices, see also [4] . For 
example, consider the function f{i) = 2b Here is the construction of a DFA 
for T{L,f) where L is the language of all strings over {a, 6} whose length is 
divisible by 3. We construct a DFA for L by hand and define symbol B to be the 
natural homomorphism /3 : A* — > from words to Boolean matrices of size 

QxQ. The ability to use higher type objects in this effortless way is a significant 
advantage in actual interactive, experimental computations. Of course, in this 
particular case this is a bit of overkill, the Boolean sum P{a) + (3(h) is a simple 
circulant matrix here. 
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m = DFA[ 3, 2, { {2 , 3 , 1} , {2 , 3 , 1} } , 1, {l} ] 

DFA[3, 2, {{2, 3, 1>, {2, 3, 1», 1, {1}] 

TransitionMatrixFA [ m, B, Type- >Boolean ]; 
M = BooleanUnion [ B[a], B [b] ] 

{{ 0 , 1 , 0 }, { 0 , 0 , 1 }, { 1 , 0 , 0 » 



We can now define an action dot that turns Q' = into a semimodule 

over S* . First, the transition function of the DFA is assigned to the symbol 
delta. Operation dot then applies delta to the first component of any pair 
(p,A) € Q\ and squares the Boolean matrix. The sub-semimodule generated by 
{qo,M) provides the state set for the DFA mm recognizing T{L,f). During the 
generation of the sub-semimodule we also produce the transition function for mm. 
Lastly, the final states {p, A) can be determined by the condition that Ip - A - Ip 
not be the null vector. 



TransitionFunctionFA [ m, delta ]; 

F = ToBitVector [ Final [m] , Range [3] ]; 
dot [{p_, P_}, s_] := 

{ delta[p,s], BooleanComposition [P, P] }; 
f inal [{p_, P_}] := ToBitVector [p, Range [3 ]]. P . F > 0; 

{Q,W,iiim} = GenerateDFA [ dot, 2, final ]; 

mm 

DFA[ 6, 2, {{2, 3, 4, 5, 6, 1}, {2, 3, 4, 5, 6, 1}}, 

1, {2, 3} ] 



W 



{Eps, a, aa, aaa, aaaa, aaaaa} 



The other fields Q and W contain the carrier set of the semimodule and a 
collection of corresponding witnesses, respectively. Either one could be used as 
the underlying state set of mm if need be. At any rate, the resulting machine 
has 6 states and a little arithmetic shows that T{L, f) should consist of all 
words of length i where i = 1, 2 (mod 6). We can verify this computationally by 
generating a few words in the language, or by evaluating its census function up 
to length 20. 
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LanguageFA [ mm, - 6 ] 

LanguageFA [ mm, -20, SizeOnly- >True ] 

{{}, {a, b}, {aa, ab, ba, bb}, {}, {}, {}, {}} 

{0, 2, 4, 0, 0, 0, 0, 128, 256, 0, 0, 0, 0, 8192, 16384, 
0, 0, 0, 0, 524288, 1048576} 



1.3 Syntactic Semigroups 

As a last example, we calculate the syntactic semigroup of a regular language. 
Consider L = {x € {a, b}* | x _3 = a}, the set of all words having an a in the 
third position from the end. 



m = MlnimlzeFA[ IthSymbolFA[ a, -3 ] ]; 

{S,W,eq} = SyntacticSG[ m. Equations - >True ]; 
S 



T[2, 


3, 


5, 


7, 


5, 


7, 


3, 


2], 


T[l, 


4, 


6, 


8, 


6, 


8, 


4, 


1], 


T[3, 


5, 


5, 


3, 


5, 


3, 


5, 


3], 


T[4, 


6, 


6, 


4, 


6, 


4, 


6, 


4], 


T[2, 


7, 


7, 


2, 


7, 


2, 


7, 


2], 


T[l, 


8, 


8, 


1, 


8, 


1, 


8, 


1], 


T[5, 


5, 


5, 


5, 


5, 


5, 


5, 


5] , 


T[6, 


6, 


6, 


6, 


6, 


6, 


6, 


6], 


T[7, 


7, 


7, 


7, 


7, 


7, 


7, 


7], 


T[8, 


8, 


8, 


8, 


8, 


8, 


8, 


8], 


T[3, 


3, 


3, 


3, 


3, 


3, 


3, 


3], 


T[4, 


4, 


4, 


4, 


4, 


4, 


4, 


4], 


T[2, 


2, 


2, 


2, 


2, 


2, 


2, 


2], 


T[l, 


1, 


1, 


1, 


1, 


1, 


1, 


1]] 



The semigroup is presented as an explicit set of transformations, i.e., func- 
tions [8] ^ [8]. Since we chose the option Equations->True, the operation also 
generates the canonical rewrite system for the semigroup, which turns out to con- 
sist of all directed equations of the form xyzu = yzu. There are 8 idempotents 
in the semigroup, and they happen to coincide with the right nulls. 



id = IdemSG[S] 

{ T[5, 5, 5, 5, 5, 5, 5, 5], T[6, 6, 6, 6, 6, 6, 6, 6], 
..., T[l, 1, 1, 1, 1, 1, 1, 1]} 

id == RightNullSG[S] 

True 
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In a similar fashion we can generate the D-class decomposition of the semi- 
group. In the visual frontend, the classes can then be rendered in their natural 
two-dimensional representation, either using transformations or their witnesses. 

2 Experimentation, Prototyping, and Production Code 

One of the goals of automata is to demonstrate the feasibility of a computational 
environment that supports interactive computation, rapid prototyping of com- 
plicated algorithms, and the use of production scientific code. As a case in point, 
take the function MinimalFischerFA that was used in the entropy computation. 
Originally this function consisted of a short segment of Mathematica code, in- 
teractively developed and based on primitives provided by the package, whose 
sole purpose it was to compute a few Fischer automata arising from a some 
examples. The code was later collected into an experimental function available 
in the package, but not yet officially supported. In the last step, the function 
became a fully supported part of the package, complete with an implementation 
as external C++ code. It can now be used both from within the package, as 
a part of a C++ library, in the form of shell scripts, or as a command in an 
interactive calculator written entirely in C++. 

Similar comments apply to all the other crucial operations that tend to be 
computational bottlenecks: in our case, computation of a power automaton, min- 
imization, generation of syntactic semigroups, to name a few. All these opera- 
tions are implemented both in Mathematica and externally in C++. Note that 
this double implementation has some advantages with respect to checking cor- 
rectness: the implementation languages Mathematica and C++ are sufficiently 
different to make it unlikely that the same error would appear in both imple- 
mentations. Of course, there is no safeguard against structurally wrong imple- 
mentations in any language. As indicated in the semimodule computation above, 
some operations can optionally produce certificates that can be used to verify 
the correctness of the output. 

Considerable effort has gone into integrating the two components as tightly as 
possible. For example, the internal algorithms dealing with finite state machines 
support an option Normalize which allows the user to preserve the natural state 
set of a machine. Thus, in a power automaton construction the state set of the 
new machine is naturally a subset of pow(Q), where Q is the state set of the 
nondeterministic machine. In a product automaton, the state set is a subset of 
Qi X Q2 and minimization produces a partition of the state set of the given 
machine. By default the state set is always normalized to [n], but whenever 
necessary we can preserve the structure even during nested operations: 



ml = ToDFA [ InfixFA[ aba ], Normalize->l ]; 
MlnlmlzeFA[ ml, Normalize->2 ] // States 



{{{!}}, {{1, 2», {{1, 3}}, {{1, 2, 4>, {1, 3, 4}, {1, 4}» 
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The same options are also available in the external code; indeed, nested lists 
of atoms (integers, strings, finite state machines, semigroups, cellular automata) 
are the basic data structure in the external code. 

Another important point is the quality of the frontend. In principle, modern 
user-interface tools make it feasible to construct relatively sophisticated front- 
ends from scratch. However, these front-ends inevitably lack versatility, and often 
are indeed limited to a narrow collection of operations. The visual presentation 
of mathematical information is a challenging task, and one should not expect 
satisfactory solutions from ad-hoc efforts. The Mathematica frontend on the 
other hand produces near-publication quality results, and provides easy access 
to a large array of operations. When the data produced by the core algorithms 
of the system are complicated in nature (e.g., the D-class decomposition of 
a semigroup), the notebook frontend greatly helps to display, manipulate and 
further analyze the data, conveniently within the whole system. 

For research applications yet another aspect of considerable importance is the 
archiving of data. Often the actual calculation of the data requires a relatively 
small program based on the machinery in the system. Ideally, the program, ac- 
companying text, the actual data, and their analysis should all be bundled in a 
single unit. Dealing with collections of separate files becomes quickly unwieldy, 
and often forces tedious and time-consuming recomputation. A coherent, plat- 
form independent interface such as a Mathematica notebook addresses all these 
issues. 

Efforts are currently under way to develop an XML standard for the represen- 
tation and exchange of finite state machines, regular expressions, and transfor- 
mation semigroups. The purpose of the standard is two-fold: first of all, it allows 
communication between software systems that are currently isolated from each 
other. For example, it should be straightforward to generate a finite state ma- 
chine on one system, pipe the output to another to have it minimized there, 
and then have a third system generate the corresponding syntactic semigroup. 
Secondly, representing mathematical objects in XML and building on existing 
standards such as OpenMath, one can easily communicate these objects over the 
internet, so that the three systems cooperating in the computation of the semi- 
group might well be located on three separate machines. The most recent version 
web browsers are capable of rendering a good amount of mathematics, hence the 
results of the computation can be observed and controlled in a browsers. The 
latest version of Mathematica offers a lot of XML support and makes it fairly 
easy to generate and convert XML based material. A prototype of the DTD and 
sample conversion routines between the XML format and the automata format 
are available at the website below. 

Needless to say, maintenance of a hybrid system poses significant challenges. 
The latest version of automata uses XML as the sole repository for the Math- 
ematica code. The various components are automatically assembled into a so- 
called add-on package via XSL style-sheets. This bundling process produces a 
software package that can simply be deposited in the user’s home directory, 
and that will then load automatically whenever a command in the package is 
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used. Short help on a per-function basis is available from the notebook inter- 
face, and there is a large collection of notebooks that demonstrate the use of the 
package. The latest version is fully integrated with the extensible Mathematica 
help-browser and provides interactive help for all the commands in the package. 

As far as the external code is concerned, it is the usual collection of header-, 
implementation- and make-files. The STL is used as the main source for standard 
data structures, see |7I3| . If desired, the user can provide memory managers to 
speed up the external code (the standard memory manager is taken from the 
STL). Apart from the nested lists of atoms the external code tries to avoid 
inheritance in favor of parametrized types; thus it is relatively easy to read, 
extend, and modify. 

The package has been brought to bear on a number of problems that might 
otherwise well have proven intractable, see |11I12I10I13| . The code is available 
at http : / /www . cs . emu . edu/"sutner 
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Abstract. We implement a set of procedures for deciding whether or 
not a language given by its minimal automaton or by its syntactic semi- 
group is locally testable, right or left locally testable, threshold locally 
testable, strictly locally testable, or piecewise testable. The bounds on 
order of local testability of transition graph and order of local testability 
of transition semigroup are also found. For given fc, the fc-testability of 
transition graph is verified. Some new effective polynomial time algo- 
rithms are used. These algorithms have been implemented as a C 
package. 

1 Introduction 

Locally testable and piecewise testable languages with generalizations are the 
best known subclasses of star-free languages with wide spectrum of applications. 

Membership of a long text in a locally testable language just depends on a 
scan of short subpatterns of the text. It is best understood in terms of a kind of 
computational procedure used to classify a two-dimensional image: a window of 
relatively small size is moved around on the image and a record is made of the 
various attributes of the image that are detected by what is observed through 
the window. No record is kept of the order in which the attributes are observed, 
where each attribute occurs, or how many times it occurs. We say that a class 
of images is locally testable if a decision about whether a given image belongs 
to the class can be made simply on the basis of the set of attributes that occur. 

Kim, McNaughton and McCloskey have found necessary and sufficient con- 
ditions of local testability for the state transition graph F of deterministic finite 
automaton [0]. By considering the cartesian product F x F, we modify these 
necessary and sufficient conditions and the algorithms used in the package are 
based on this approach. 

The locally threshold testable languages were introduced by Beauquier and 
Pin pp. These languages generalize the concept of locally testable language and 
have been studied extensively in recent years. 

Right [left] local testability was introduced and studied by Konig [TT] and 
by Garcia and Ruiz [S] . These papers use different definitions of the conception 
and we follow here [S]: 

A finite semigroup S is right [left] locally testable iff it is locally idempotent 
and locally satisfies the identity xyx = xy [xyx = yx\. 
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We introduced polynomial time algorithms for the right [left] local testability 
problem for transition graph and transition semigroup of the deterministic finite 
automaton. Polynomial time algorithm verifies transition graph of automaton 
with locally idempotent transition semigroup. 

There are several systems for manipulating automata and semigroups. The 
list of these systems is following 0 and preprint of [3: 

REGPACK [12] AUTOMATE |5] AMoRE [13| Grail |l7] The FIRE Engine 
j l27l LANG AGE Ej. APL package Ej. Froidure and Pin package jZj. Sutner pack- 
age Whale Galf dS]. 

Some algorithms concerning distinct kinds of testability of finite automata 
were implemented by Garon m, H- His programs verify piecewise testable, lo- 
cally testable, strictly and strongly locally testable languages. 

In our package TESTAS (testability of automata and semigroups), the area 
of implemented algorithms was essentially extended. We consider important and 
highly complicated case of locally threshold testable languages [2Sj . The tran- 
sition semigroups of automata are studied in our package at the first time [22] . 
Some algorithms (polynomial and even in some way non-polynomial) check the 
order of local testability [211 • We implement a new efficient algorithm for piece- 
wise testability improving the time complexity from 0(n®) [3|, [TU] to 0{v?) 
PS| . We consider algorithms for right local testability time and space 

complexity), for left local testability (O(n^) time and space complexity) and the 
corresponding algorithms for transition semigroups {0{rP‘) time and space com- 
plexity). The graphs of automata with locally idempotent transition semigroup 
are checked too (O(n^) time complexity). All algorithms dealing with transition 
semigroup of automaton have 0{n?) space complexity. 



2 Algorithms Used in the Package 

Let the integer a denote the size of alphabet and let g be the number of nodes. 
By n let us denote here the size of the semigroup. 

The syntactic characterization of locally threshold testable languages was 
given by Beauquier and Pin |T]. From their result follow necessary and sufficient 
conditions of local threshold testability for transition graph of DFA [2S| and used 
in our package a polynomial time algorithm for the local threshold testability 
problem for transition graph and for transition semigroup of the language. 

Let us notice here that the algorithm for transition graph from m im) 
is valid only for complete graph. Of course, the general case can be reduced to 
the case of complete graph by adding of a sink state. Let us notice also another 
error from [2S| (jlS]): in the Theorem 16 (17) in the list of the conditions of local 
threshold testability, the property that any Tscc is well defined is missed. 

The time complexity of the graph algorithm for local threshold testability is 
0{ag^). The algorithm is based on consideration of the graphs and F^ and 
therefore has 0{ag^) space complexity. The time complexity of the semigroup 
algorithm is 0(v?). 
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Polynomial time algorithms for the local testability problem for semigroups 
of order O(n^) and for graphs [25] of order 0{ag^) are implemented in the 
package too. We use in our package a polynomial time algorithm of worst case 
asymptotic cost 0{ag'^) for finding the bounds on order of local testability for a 
given transition graph of the automaton |24| and a polynomial time algorithm of 
worst case asymptotic cost 0{ag^) for checking the 2-testability |^. Checking 
the /c-test ability for fixed k is polynomial but growing with k. For checking the 
/c-testability m, we use an algorithm of worst case asymptotic cost 0{g^a^ ^). 
The order of the last algorithm is growing with k and so we have non-polynomial 
algorithm for finding the order of local testability. The algorithms are based 
on consideration of the graph and have 0{ag^) space complexity. The 1- 
testability is verified by help of algorithm of cost 0{a?g). 

The situation in semigroups is more favorable than in graphs. We implement 
in our package a polynomial time algorithm of worst case asymptotic cost O(n^) 
for finding the order of local testability for a given semigroup |22]. The class of 
locally testable semigroups coincides with the class of strictly locally testable 
semigroups |23], whence the same algorithm of cost O(n^) checks strictly locally 
testable semigroups. 

Stern mi modified necessary and sufficient conditions of piecewise testabil- 
ity of DFA (Simon |18) ') and described a polynomial time algorithm to verify 
piecewise testability. 

We use in our package a polynomial time algorithm to verify piecewise testa- 
bility of deterministic finite automaton of worst case asymptotic cost 0{ag^) 
p5j . In comparison, the complexity of Stern’s algorithm [19] is 0{ag^). Our al- 
gorithm uses 0{ag^) space. We implement also an algorithm to verify piecewise 
testability of a finite semigroup of cost 0{n?) 

3 Description of the Package TESTAS 

The package includes programs that analyze: 

1) an automaton of the language presented as oriented labeled graph; 

2) an automaton of the language presented by its syntactic semigroup, 

and find 

3) the direct product of two semigroups or of two graphs, 

4) the syntactic semigroup of an automaton presented by its transition graph. 

First two programs are written in CjC^^ and can by used in WINDOWS 

environment. The input file may be ordinary txt file. We open source file with 
transition graph or transition semigroup of the automaton in the standard way 
and then check different properties of automaton from menu bar. Both graph 
and semigroup are presented on display by help of rectangular table. 

First two numbers in input graph file are the size of alphabet and the number 
of nodes. Transition graph of the automaton is presented by the matrix: 

nodes X labels 

where the nodes are presented by integers from 0 to n-1. i-th line of the matrix is 
a list of successors of i-th node according the label in row. The (i,j) cell contains 
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number of the node from the end of the edge with label from the j-th row and 
beginning in i-th node. There exists opportunity to define the number of nodes, 
size of alphabet of edge labels and to change values in the matrix. 

The input of semigroup algorithms is Cayley graph of the semigroup pre- 
sented by the matrix: 

elements X generators 

where the elements of the semigroup are presented by integers from 0 to n — 1 
with semigroup generators in the beginning, i-th line of the matrix is a list of 
products of i-th element on all generators. 

Set of generators is not necessarily minimal, therefore the multiplication table 
of the semigroup (Cayley table) is acceptable too. Comments without numerals 
may be placed in the input file as well. 

The program checks local testability, local threshold testability and piece- 
wise testability of syntactic semigroup of the language. Strictly locally testable 
and strongly locally testable semigroups are verified as well. The level of local 
testability of syntactic semigroup is also found. Aperiodicity and associative low 
can be checked too. There exists possibility to change values of products in the 
matrix of the Cayley graph. 

The checking of the algorithms is based in particular on the fact that the con- 
sidered objects belong to variety and therefore are closed under direct product. 
Two auxiliary programs written in C that find direct product of two semigroups 
and of two graphs belong to the package. The input of semigroup program con- 
sists of two semigroup presented by their Cayley graph with generators in the 
beginning of the element list. The result is presented in the same form and the 
set of generators of the result is placed in the beginning of the list of elements. 
The number of generators of the result is nig2 + ri2gi — gig2 where Ui is the size 
of the i-th semigroup and gi is the number of its generators. The components 
of direct product of graphs are considered as graphs with common alphabet of 
edge labels. The labels of both graphs are identified according their order. The 
number of labels is not necessary the same for both graphs, but the result al- 
phabet used only common labels from the beginning of both alphabets. Big size 
semigroups and graphs can be obtained by help of these programs. 

An important verification tool of the package is the possibility to study both 
transition graph and semigroup of an automaton. The program written in C 
finds syntactic semigroup from the transition graph of the automaton. 

Maximal size of semigroups we consider on standard PC was about some 
thousands elements. Maximal size of considered graphs was about some hundreds 
nodes. The program used in such case memory on hard disc and works some 
minutes. 
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Abstract. This paper shows a comparison of two data structures used 
for indexing of inpnt texts. The first strncture is the Suffix Array and 
the second is the Directed Acyclic Word Graph (DAWG). We present 
an efficient DAWG implementation. This implementation is compared 
with other strnctures used for text indexing. The construction time and 
speed of searching of a set of snbstrings are shown for the DAWG and 
the Snffix Array. 



1 Introduction 

Indexing structures support pattern matching in a linear time with respect to the 
length of the pattern are constructed for static texts. Although some indexing 
structures have a linear size with respect to the length of the text, this size is 
high enough to prevent practical implementation and usage. For the suffix tree 
the size is rarely smaller than lOn bytes, where n is the length of the text. Other 
structures are a Directed Acyclic Word Graph (DAWG) (size about 30n bytes) 
and its compact version CDAWG (size about lOn bytes). Other types of indexing 
structures are usually smaller, suffix arrays j6] (size 5n bytes), level compressed 
tries [1] (size about lln bytes), suffix cactuses - a combination of suffix trees and 
suffix arrays |T0] (size 9n bytes), and suffix binary search trees (size about lOn 
bytes) . 

We have used a set of 31 files in the experiments. 17 files are taken from the 
Calgary Corpus and 14 files from the Canterbury Corpus |5]. Both Corpora are 
widely used to compare lossless data compression programs. These files are also 
used in m, and therefore our results can be compared to the previous work. 

2 Basic Definitions 

An alphabet is a finite set of symbols. A string over a given alphabet is a finite 
sequence of symbols. Let T = tit 2 ... be a string (text) over a given alphabet 
A. A pattern P of length to is a substring of a text T if P = tiU+i . . . ti+ra-i- A 
pattern P = Uti+i . . . is called f-th suffix of the text. To determine whether a 

* This research has been partially supported by the Ministry of Education, Youth, 
and Sports of Czech Republic under research program No 304/98:212300014 and by 
the Grant Agency of Czech Republic under research program No 102/01/1433. 
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pattern P is a substring (subpattern, subword, factor) of a text T is a pattern 
matching problem. DAWG(T) is a minimal automaton that accepts all suffixes 
of a text T. The second structure to be compared is a suffix array. Suffix array 
is an array of pointers to the text. Each pointer represent one suffix of the text, 
its beginning in the text. That pointers are ordered in increasing lexicographical 
order of pointed suffixes. 



3 Construction of DAWG and SufRx Array 

There are many ways of constructing DAWG from text. More details can be 
found for example in [^. The method used here is the on-line construction al- 
gorithm. An example of DAWG constructed using this algorithm for an input 
text T = acagac is shown in Fig. [T] 




Fig. 1. DAWG (acagac) and SufRx array (aca^oc). 



An implementation of DAWG presented in this paper uses a compression 
of elements of the graph that represents the automaton to decrease space re- 
quirements. The ‘compression’ is not a compression of the whole data structure, 
which would mean to perform decompression to be able to work with it, but it is 
a compression of individual elements, so it is necessary to decompress only those 
elements that are necessary during a specific search. This method is applicable 
to all homogenous automat£0 and it can be generalized to all automata that 
accept a finite set of strings and to all structures, which can be drawn as an 
acyclic graph. 

The whole graph is a sequence of bits in a memory that can be referenced 
by pointers. A vertex is a position in the bit stream where a sequence of edges 
originating from the vertex begins. These edges are pointers into the bit stream. 
They point to memory cells where the corresponding terminal vertices cells are 

^ Homogenous automata have all transitions to a specific state labelled with the same 
symbol. 
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located. Vertices are stored in a topological ordei0, which ensures that a search 
for a pattern is a one-way pass through the implementation of the graph struc- 
ture. 

Each vertex contains an information about labels of all edges that lead to it 
and the number of edges that start from it. Since it is possible to construct a 
statistical distribution of all symbols in the text, we can store edge labels using 
a Huffman code [8]. We also use a Huffman code to encode the number of edges 
that start from a vertex. The most frequent case of a node having only one 
outgoing edge is then dealt with as a special case. 

The last part of the vertex contains references to vertices that can be accessed 
from the current vertex. These references are realized as relative addresses with 
respect to the beginning of the next element. 

We will use an address consisting of two parts: the first part (of length s) will 
determine the number of bits of the second part, the second part will determine 
the distance of the ending vertex in bits. The best results are obtained for s = 3. 
This value is sufficient for a wide range of input text file lengths, which guarantees 
a simple implementation. This observation is based on experimental evaluation. 
Suppose s = 3, t = 32. This encoding regularly divides address codes into eight 
categories by four bits. The longest address is 32 bits long. It corresponds to the 
maximal size of the text of about 100 MB long. 

The approach presented here creates a DAWG structure in three phases. 
The first phase is the construction of the usual DAWG, the second phase is 
the topological ordering (or re-ordering) of vertices, which ensures that no edge 
has a negative “length”, where the length is measured as a difference of vertex 
numbers. The final phase is the encoding and storing the resulting structure. 
The first phase uses an on-line construction algorithm and takes the space peak 
in the construction of the DAWG. This creates a large space overhead. This 
overhead depends on the type of the text, but the whole working space can be 
bounded by 64 * n, where n is the size of the text. 

In the Suffix array suffixes of the text are represented by the position of their 
beginnings in the text. The i-th field of the suffix array is initialized with the 
number i. All suffixes of the text are represented in the suffix array, the longest 
suffix is represented with number 1 and the shortest suffix is represented with 
number n. The construction of the suffix array consists of ordering initialized 
suffix array. The sorting operation between two fields of suffix array compares 
corresponding suffixes and the resulting value depends on its lexicographical 
order. The resultant suffix array represents the list of suffixes of the text and 
performs binary searches on them. An example of suffix array constructed for 
an input text T = acagac is shown in Fig. [T]. 



^ A topological ordering of a graph is a numbering of vertices that ensures that each 
edge starts from a vertex with a lower number and ends at a vertex with a higher 
number. 
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Table 1. Relative space requirements (in bytes per input symbol) of DAWG and 
other structures, File . . . the name of the testing file; Source . . . CL - Calgary Corpus, 
CN - Canterbury Corpus; Length . . . the size of the text; | A| ... the size of the input 
&\p\iabet\{DAWG,CDAWG, and Suff. Tree) cite the results of the implementation 
published by Kurtz in dl and show the relative space requirement CDAWG and suffix 
tree use the original text for string matching, but the length of the text is not scored 
up to the shown ratio; S. Array . . .the lower bound of the size of suffix array using 
the shortest pointers. Whole pointers have the same size and this size is \log 2 (n)~\ 
bits. DAWGl corresponds to the addresses that use eight categories (three bits per 
category code). These categories regularly divide possible lengths from the length 0 
bits to the maximal size of 32 bits. The first category corresponds to addresses 4 bits 
long, the second corresponds to addresses 8 bits long, etc. This means that this address 
encoding is sufficient for texts less than approximately 100 MB. Some categories are 
not used if the DAWGl is created for a small text. DAWG2 is obtained for a code 
with eight categories, where the first category denotes the address of length zero, i.e. 
no address is stored and the referred element leads to the next processed element. The 
next categories regularly divide possible lengths from 4 bits to the maximal size. 



File 


Source 


Length \A\ 


DAWG CDAWG S.Tree S.Array 


DAWG.l DAWG. 2 


bookl 


CL 


768771 


81 


30.35 


15.75 


9.83 


3.50 


3.75 


3.66 


book2 


CL 


610856 


96 


29.78 


12.71 


9.67 


3.50 


3.25 


3.17 


paper 1 


CL 


53161 


95 


30.02 


12.72 


9.82 


3.00 


3.09 


2.98 


paper2 


CL 


82199 


91 


29.85 


13.68 


9.82 


3.13 


3.19 


3.06 


paper3 


CL 


46526 


84 


30.00 


14.40 


9.80 


3.00 


3.24 


3.12 


paper4 


CL 


13286 


80 


30.34 


14.76 


9.91 


2.75 


3.21 


3.04 


papers 


CL 


11954 


91 


30.00 


14.04 


9.80 


2.75 


3.14 


2.97 


paper6 


CL 


38105 


93 


30.29 


12.80 


9.89 


3.00 


3.08 


2.96 


alice29 


CN 


152089 


74 


30.27 


14.14 


9.84 


3.25 


3.34 


3.20 


IcetlO 


CN 


426754 


84 


29.75 


12.70 


9.66 


3.38 


3.21 


3.12 


plrabnl2 


CN 


481861 


81 


29.98 


15.13 


9.74 


3.38 


3.60 


3.52 


bible 


CN 


4047392 


63 


29.28 


10.87 


7.27 


3.75 


3.01 


2.94 


world 192 


CN 


2473400 


94 


27.98 


7.87 


9.22 


3.75 


2.58 


2.53 


bib 


CL 


111261 


81 


28.53 


9.94 


9.46 


3.13 


2.76 


2.68 


news 


CL 


377109 


98 


29.48 


12.10 


9.54 


3.38 


3.25 


3.15 


proKc 


CL 


39611 


92 


29.73 


11.87 


9.59 


3.00 


2.98 


2.87 


progl 


CL 


71646 


87 


29.96 


8.71 


10.22 


3.13 


2.48 


2.40 


progp 


CL 


49379 


89 


30.21 


8.28 


10.31 


3.00 


2.44 


2.35 


trans 


CL 


93695 


99 


30.47 


6.69 


10.49 


3.13 


2.41 


2.35 


fields 


CN 


11150 


90 


29.86 


9.40 


9.78 


2.75 


2.54 


2.43 


cp 


CN 


24603 


86 


29.04 


10.44 


9.34 


2.88 


2.75 


2.64 


grammar 


CN 


3721 


76 


29.96 


10.60 


10.14 


2.50 


2.48 


2.36 


xargs 


CN 


4227 


74 


30.02 


13.10 


9.63 


2.63 


2.90 


2.75 


asyoulik 


CN 


125179 


68 


29.97 


14.93 


9.77 


3.13 


3.49 


3.34 


geo 


CL 


102400 256 


26.97 


13.10 


7.49 


3.13 


3.27 


3.18 


objl 


CL 


21504 256 


27.51 


13.20 


7.69 


2.88 


3.12 


2.98 


obj2 


CL 


246814 256 


27.22 


8.66 


9.30 


3.25 


2.75 


2.67 


pttS 


CN 


513216 159 


27.86 


8.08 


8.94 


3.38 


1.70 


1.63 


kennedy 


CN 


1029744 256 


21.18 


7.29 


4.64 


3.50 


1.65 


1.57 


sum 


CN 


38240 255 


27.79 


10.26 


8.92 


3.00 


2.62 


2.53 


ecoli 


CN 


4638690 


4 


34.01 


23.55 


12.56 


3.75 


4.59 


4.46 
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4 Results 

The space needed for a suffix array construction depends on the size of the 
text and on the sorting algorithm. The text size determines the size of pointers 
to the text. There are 32 bit long pointers used in our experiments to achieve 
simple implementation. The space occupied by the suffix array is 4*n bytes, and 

5 * n bytes after adding the text needed for string matching. We use a standard 
quick sort algorithm, which does not need extra space for sorting. The quick 
sort algorithm works in 0{n * n) time in the worst case, but in (n * login)) on 
average. The DAWG can be created using the on-line construction algorithm in 
0{n) time |3]. Vertex re-ordering can also be done in 0{n) time, the encoding of 
DAWG elements as described above can also be done in 0{n) time. This means 
that the described DAWG construction can be performed in 0{n) time. The 
time of construction is similar for both structures. 

The space requirements are a very important qualifier of the implementation. 
Table [U shows the results for the set of test files. 

The time complexity of searching in such an encoded DAWG is 0(m), see 
j2]. This complexity supposes that the time of decoding of individual parts is 
constant with respect to the size of the text. Next, it is supposed that the 
time of searching for the appropriate transition is done in a constant time, but 
this is done in a time linear with respect to the size of the alphabet using our 
implementation. So we made some tests to compare the time of string searching. 

In experiments there are patterns that occur in texts. The lengths of the 
patterns range between 4 and 20 symbols. There were 1000 patterns prepared 
for each file, and this set was searched ten times to measure the time. We used 
PC computer, the Linux operating system, a Pentium processor and 500MB 
RAM. The search time was relatively constant, for suffix array it was about 0.5 
second, for the DAWG in our implementation about 15 seconds. The suffix array 
is better, but the set of patterns could be preprocessed using a trie automaton. 
Then the task of searching a set of pattern could be solved by intersection of the 
trie and DAWG automaton. This technique is used for solving more complex 
tasks, see jZ]. 

5 Conclusion 

A new method of DAWG implementation is presented and compared to the 
implementation of the suffix array. The results show that the ratio of code file 
size to the input file size is about 3:1. 

It was shown that the suffix array is simpler to implement, it is faster to 
construct and the search time is also shorter. The DAWG is a more complex 
structure and is more suitable for complex tasks, for example approximate string 
matching. 
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Abstract. Extended context-free grammars are context-free grammars 
in which the right-hand sides of productions are allowed to be any regular 
language rather than being restricted to only hnite languages. We present 
a novel view on top-down predictive parser construction for extended 
context-free grammars that is based on the rewriting of partial syntax 
trees. This work is motivated by our development of ecfg, a Java toolkit 
for the manipulation of extended context-free grammars, and by our 
continuing investigation of XML. 



1 Introduction 

We have been investigating XML the Web language for encoding structured 
documents, its properties and use for a number of years mm- The language 
XML itself, the document grammars that XML defines, and various other Web 
languages are defined by extended context-free grammars; that is, context-free 
grammars in which the right-hand sides of productions are allowed to be any 
regular language rather than being restricted to only finite languages. Hence, we 
became interested in factoring out all grammar processing that Web applications 
are based on and need to perform, into a separate toolkit that we call ecfg. 

A cornerstone of ecfg is to be able to generate parsers from grammars. We 
present in this paper the principles of generating strong LL(1) parsers from a sub- 
set of the extended context-free grammars that is pertinent to Web grammars. 
We call these parsers eSLL(l) parsers. In particular, we contribute a novel 
view to predictive parsing that is based on what we call partial syntax trees. 
The parser generator is intended to be one tool among many in our toolkit. 

LaLonde m appears to have been the first person to seriously consider the 
construction of parsers for extended context-free grammars. The construction 
of LL(l)-like parsers for extended context-free grammars has been discussed by 
Heckmann [^, by Lewi and his co-workers na, and by Sippu and Soisalon- 
Soininen m- Warmer and his co-workers [THITTH and Clark have devel- 
oped SGML parsers based on LL(1) technology. Mossenbock [TJ and Parr and 
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Quong [15j have implemented LL(1) parser generators for extended context-free 
grammars. 

We give, in Section |2] some general background information on extended 
context-free grammars and Web languages and discuss the level of support that is 
currently available for grammar manipulation. We then define the basic notation 
and terminology that we need, in Section |3] before introducing partial syntax 
trees, in Section |4] In Section |5] we describe a nondeterministic algorithm eNSLL 
to compute a sequence of leftmost partial syntax trees for a given input string. 
We define an extended context-free grammar to be an eSLL(l) grammar if 
and only if the algorithm eNSLL is actually deterministic and we characterize 
eSLL(l) grammars in terms of first and follow sets. We mention further results 
in the final section. 

2 The Grammar Toolkit 

Starting with XML, we present in this section some general background infor- 
mation on extended context-free grammars and Web languages and discuss the 
level of support that is currently available for grammar manipulation. We focus 
on the facts that have led to our decision to develop the grammar toolkit ecfg 
and to equip it with a predictive-parser generator. 

XML is defined by an extended context-free grammar. The XML grammar 
derives XML documents that consist of an optional Document Type Defini- 
tion (DTD) and the document proper, called the document instance. The XML 
grammar describes the syntax for DTDs and instances in general terms. The 
DTD is specific to an application domain and not only defines the vocabulary 
of elements, attributes and references in the document but also specifies how 
these constructs may be combined. The DTD is again an extended context-free 
grammar. 

There are a number of XML parsers that read a DTD and a document 
instance, and are able to determine whether both follow the general rules of 
the XML grammar and whether the instance conforms to the DTD. Further- 
more, there are two well-established means for application programs to access 
XML data; namely DOM, a W3C standard, and SAX, an industry standard. 
XML parsers typically support both of these Application Programming Inter- 
faces (APIs). 

It is curious that none of the XML tools we are aware of provide API access 
to the DTD of an XML document. This limits and hinders the development of 
XML tools that are customized for the application domain and of tools that read 
and manipulate DTDs such as DTD-aware editors for document instances, DTD- 
browsers, DTD-analyzers and DTD-aware query optimizers for XML documents. 
State-of-the-art XML tools treat DTDs as black boxes; they may be able to 
handle DTDs but they do not share their knowledge! 

For this reason we propose a more transparent approach to the development 
of XML tools that is based on ecfg, a Java toolkit for the manipulation of 
extended context-free grammars, that we are currently implementing. 
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Grammars are ubiquitous. In the Web context alone, we have not only the 
XML grammar and XML DTDs but also XML Schemas, the CSS grammar, the 
XPath grammar, and specific DTDs or quasi-DTDs such as MathML and XSLT. 
Web tools such as a CSS processor, an XPath query optimizer and an XML 
processor that validates a document instance against its DTD are very different 
applications. With ecfg we aspire to support any grammar manipulations that 
these diverse tools might need to perform. 

The cornerstone of grammar manipulation is to generate parsers automat- 
ically from grammars. The theory of parser generators is well understood and 
has been explored, at least for the case of nonextended grammars, in a number 
of textbooks mm- Furthermore, this knowledge is embodied in generator 
tools such as lex, flex, bison, yacc, Coco/R and ANTLR. A parser generator 
constructs a parser from a given grammar. The parser in turn converts a symbol 
stream into a syntax tree or some other structure that reflects the phrase struc- 
ture of the symbol stream; alternatively and more commonly, the parser does 
not expose the parse tree itself but triggers semantic actions that are coded into 
its grammar for each phrase that it recognizes in the symbol stream. 

In the context of Web languages, the languages that a grammar derives are 
often grammars once more, as exemplified by the XML grammar that derives 
DTDs; that is, document grammars. These grammars not only need to be turned 
into parsers again, which would be in the scope of standard parser generators, 
but they also need to be analyzed and transformed in complex ways. The analysis 
requires, in some cases, computations that parser generators perform but do not 
expose; for example, computations of first and follow sets. For this reason, we 
have decided to design and implement our own parser generator in ecfg. In our 
domain of applications, grammars need to be first-class citizens. 

It is be a common assumption that XML documents are easy to parse; 
whereas this assumption has not been formally verified, is obviously true for doc- 
ument instances, which are fully bracketed. Hence, of all the alternative parsing 
strategies that are in use, our parser generator employs the simplest one, namely 
the strong LL approach to parsing with a one-symbol lookahead, generalizing it 
from “normal” to extended context-free grammars. 



3 Notation and Terminology 

An extended context-free grammar G is specified by a tuple of the form 
{N, E, P, S), where is a nonterminal alphabet, A is a terminal alphabet, P 
is a set of production schemas of the form A — > La, such that A is a 
nonterminal and La is a regular language over the alphabet E U N, and S, the 
sentence symbol, is a nonterminal. Given a production schema A — > La such 
that a is a string in La, we say that A — > o is a production of G. We call 
La the rhs-language of A. Since the rhs-languages of extended grammars are 
regular, extended and “normal” grammars derive the same class of languages, 
namely the context-free languages m- 
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We represent the set of production schemas P of an extended context-free 
grammar G = {N, S, P, S) as a transition diagram system |7l8j which provides 
a nondeterministic finite automata (NFAs) for the rhs-language of each nonter- 
minal. We require the automata to be of Glushkov type so that each state is 
labeled by a symbol in U H and each incoming transition bears that state’s 
label. 

In practice, the rhs-languages of an extended grammar are given as regular 
expressions. A regular expression can, however, be transformed into a Glushkov- 
type NFA in linear time P]. 

A transition diagram system DS = (Q, label, F, init, trans, belongs) 

over N and E has a set of states Q, a relation label C Q x {N U E) that 
maps each state to at most one symbol, a set of final states F C Q, a relation 
init C NxQ that assigns exactly one initial state to each nonterminal, a relation 
trans C QxQ for the transitions between states and a relation belongs C QxN 
that maps each state to the unique nonterminal to whose rhs-automaton the state 
belongs. 

A relationship p trans g implies that q has some label X (that is, g label AT) 
and means that there is a transition from p to g on the symbol X. This 
notion of transition relation accounts for the Glushkov property. A produc- 
tion A — > a of the grammar G translates into a string a of the NFA 
Ma = {N U F, Q,p^, F, trans) such that A'mitpA- Each state in the transi- 
tion diagram system is uniquely assigned to some nonterminal via the relation 
belongs. A state that is reachable from a nonterminal’s initial state must be- 
long to that same nonterminal. When we construct the automata from the rhs- 
expressions of a grammar, the sets of states must hence be chosen to be disjoint. 

For the remainder of the paper, let E denote a set of terminals and N 
denote a set of nonterminals; their union F U iV forms a vocabulary. An 
extended context-free grammar is given as G = {N, E, P, S); its set of pro- 
duction schemes P is represented by the transition diagram system DS = 
(Q, label, F, init, trans, belongs) over N and F. 

Names A and B denote nonterminals, a and b denote terminals, X denotes a 
symbol in the vocabulary FUiV, p denotes a state, a, f3 and 7 denote strings over 
the vocabulary, whereas rt, v and w denote strings over F. If we need additional 
names of any of these types, we use embellishments. 

4 Partial Syntax Trees 

We base our approach to parsing on the use of partial syntax trees whose nodes 
are labeled with symbols in the vocabulary F U and some of whose nodes are 

annotated with states in Q. Internal nodes are always labeled with nonterminal 
symbols from N\ external nodes are labeled with symbols from either alphabet. 
Only nodes that have a nonterminal label may have an annotation. We call those 
nodes active. 

A partial syntax tree represents the progress a parser has made in construct- 
ing a syntax tree for a given input string of terminals in a top-down manner. An 
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active node represents a construction site where the parser may append further 
nodes to the list of child nodes, thus gradually expanding the active node. When 
the parser has completed the expansion work at a construction site, it will re- 
move the annotation to make the node inactive. The goal is to construct a partial 
syntax tree without any active nodes such that its terminal-labeled leaves spell 
out the input string that is to be parsed. Leaves that are inactive and labeled 
with a nonterminal are not expanded and thus contribute the empty string to 
the input string. 

A grammar, particularly its transition diagram system, constrains the work 
of a parser and determines if the partial syntax tree that is constructed con- 
forms to the grammar: First of all, the tree’s root must be labeled with the 
grammar’s sentence symbol. Furthermore, the labels and annotations in the tree 
must conform to the grammar in the following way: 

— For each inactive node v, the labels of the children of v must spell out a 
string in the rhs-language of v’s label. 

— For each active node v, its state annotation is reachable from the node label’s 
initial state by the input string formed by the sequence of labels of v’s 
children. 

Particularly, each external active node must be annotated with the initial state 
of the node’s label and the language of an inactive external node’s label must 
contain the empty string. 

Since we wish to explore top-down left-to-right parsing, we are particularly 
interested in leftmost partial syntax trees in which the active nodes form a 
prefix of the rightmost branch of the tree. A leftmost construction is the analog 
of a leftmost derivation for “normal” context-free grammars. The frontier of a 
leftmost partial syntax tree is a sequence, from left to right, of inactive nodes, 
followed by at most one active node. The sequence of nodes in the frontier that 
have terminal labels yields a terminal string over S, which we call the yield of 
the leftmost partial syntax tree. 

We call a partial syntax tree on which all work has been completed (that is, 
which has no active nodes left) just a syntax tree. Whereas a partial syntax 
tree represents a parser’s work-in-progress, a syntax tree represents the finished 
product that may then be exposed to application programs. 

A partial syntax tree of grammar G can be constructed incrementally by 
beginning with an initial one-node partial syntax tree and then, step by step, 
adding nodes, changing the states that are associated with nodes and making 
active nodes inactive. Rather than applying a whole production A — > a, a = 
X\ - ■ ■ Xn, in a single step to a node v with label A, we add n children to v one 
after the other, in n steps, labeling the children with x\,...,Xn and keeping 
track in the state annotation of v how far we have progressed. We can view this 
process as a sequence of transformations on partial syntax trees. 

A single transformation manipulates, nondeterministically, an active node v 
of a partial syntax tree in one of the following three ways, where we assume that 
V is labeled with a nonterminal A and is annotated with a state p that belongs 
to A: 
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1. If p is a final state, then v is made inactive by removing the state p from v. 
We call this transformation a reduce step. 

2. If there is a transition from p to p' and if the label of state p' is a terminal a, 
then a new node v' is added as a new rightmost child of v. The new node is 
labeled with a and v is annotated with state p' . We call this transformation 
a shift step. 

3. If there is a transition from p to p' and if the label of state p' is on a 
nonterminal B then a new active node v' is added as a new rightmost child 
of V. The new node is labeled with B and is annotated with the initial state 
which is associated with B. Futhermore, v is annotated with state p' . We 
call this transformation an expand step. 

A construction of a partial syntax tree is a sequence of transformations that 
begins with the initial partial syntax tree whose one node is active, is labeled 
with the sentence symbol S and is annotated with the initial state of S and that 
ends with the partial syntax tree itself. 

A leftmost construction is a construction that begins with the initial par- 
tial syntax tree and consists only of leftmost partial syntax trees. (Note that the 
initial partial syntax tree and all syntax trees are leftmost.) 

Theorem 1. The language of a grammar consists of the yields of the grammar- 
conformant syntax trees. Furthermore, for each such syntax tree there is a left- 
most construction. 



5 Predictive Parsing 

This section focuses on parsing; that is, for each terminal string it of a grammar, 
we construct a syntax tree whose yield is it. Our approach to parsing is top- 
down with a look-ahead of one symbol; that is, we do left-most constructions 
and read the input string from left to right, advancing at most one position at 
each transformation step. At each step, the choice of transformation is guided 
by the current input symbol. 

A leftmost partial syntax tree is compatible with a given terminal string if 
its yield is a prefix of the string. 

As our parsing strategy, we present a nondeterministic algorithm eNSLL 
to compute a leftmost construction in which each leftmost partial syntax tree 
is compatible with a given input string ai • • • a„ of terminals. The algorithm 
generalizes strong LL(1) parsing for extended grammars, which accounts for its 
name. 

When given a leftmost partial syntax tree that is compatible with the input 
string, the algorithm expands the tree’s deepest active node using one of the 
three transformation types given in Section Ul When choosing a transformation 
type, the algorithm eNSLL is guided by the input symbol immediately to the 
right of the partial syntax tree’s yield. We call this symbol the current input 
symbol. If the algorithms has moved beyond the end of the input string, we set 
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the current input symbol to the empty string, thus signaling to the algorithm 
that it has read the full input string. 

We need to introduce a number of concepts to describe the behaviour of 
eNSLL: 

First, for each pair of states p and p' such that ptransp' let L{p,p') be the 
language of all strings over S that are the yields of syntax trees constructed 
from the one-node initial tree whose label is the symbol that is associated with p 
and whose state annotation is p such that the first transformation is to add a 
child to the root that is labeled with the symbol of p' and to annotate the root 
with state p' . Formally, L{p,p') is the union of all languages L{Xa) such that 
p' label X and a in (fV U if)* moves the grammar’s transition diagram system 
from p' to some final state. 

Next, follow{A) is the set of all terminals that can occur immediately after A 
in a string that G can derive from S. More formally, follow{A) consists of all 
a in H such that G derives aAaf3 from S, for some a and (3 in {N U S)* . In 
addition, we add the empty string to follow (A) for each A for which G derives 
some a A. 

Finally, we consider every leftmost partial syntax tree whose deepest active 
node V is annotated with some state p and labeled with the symbol of p, and 
any construction whose first transformation step adds a new rightmost child to 
V and that annotates v with some state p' such that ptransp' while labeling 
the new node with the symbol of p. The first terminals that are derived by 
such constructions form the set first{p,p'). To be precise, first{p,p') consists of 
the first symbols of the strings in L{p,p') follow{A); note that we consider the 
empty string to be the first symbol of the empty string. Furthermore, firstijo) is 
the union of all first{p,p') such that ptransp'. 

The algorithms eNSLL computes a leftmost construction for a syntax tree 
whose yield is a given input string of terminals. The algorithm starts with the 
initial one-node partial syntax tree that each construction starts with. At each 
step of the construction, it chooses nondeterministically a reduce, shift, or expand 
step to continue its computation, but the choice is constrained by the next input 
symbol and by the first and follow sets. 

In order to be more precise, let us look at the following scenario: We are 
given a leftmost partial syntax tree t that is compatible with some terminal 
string. Furthermore, we assume that we can continue any leftmost construction 
of t until we reach a syntax tree for the terminal string. Let the tree’s deepest 
active node v have label A and annotation p. Finally, let a be the current input 
symbol. In this situation, eNSLL performs a reduce step only if a is in follow (A), 
it performs a shift step only if p' label a, and it performs an expand step only if 
a is in first{p,p'). 

An eNSLL computation terminates if no further transformation steps are 
possible. The eNSLL algorithm accepts an input string if it terminates with a 
partial syntax tree that is, in fact, a syntax tree and whose yield is the complete 
input string. 
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Theorem 2. The eNSLL algorithm of a grammar accepts exactly the strings of 
the grammar. 

We say that a grammar is eSLL(l) if and only if the eNSLL algorithm for 
the grammar is deterministic. 

Theorem 3. An extended context-free grammar is eSLL(l) if and only if it 
satisfies the following two conditions: 

1. For each final state p such that ^belongs A, the sets first(p) and follow{A) 
are disjoint. 

2. For each pair of different states q, q' to which there are transitions from p, 
the sets first (q) and first {q') are disjoint. 



Theorem 4. We can test if a grammar is eSLL(l) in worst-case time 0(1271 • 

|G|). 



Theorem 5. Let the rhs-languages of an extended context-free grammar be de- 
fined by regular expressions and let the grammar’s transition diagram system be 
computed with the Glushkov construction If the grammar is eSLL(l), then 
the transition diagram system must be deterministic. 



6 Further Results 

In addition to providing the proofs for the theorems in this paper, the full version 
of this paper investigates two related topics: 

First, it is straightforward to build a parsing table of parse actions from the 
first and follow sets that drive the eNSLL algorithm. There is at most one parse 
action in each table cell if and only if the grammar is eSLL(l). We can build the 
parse table in worst-case time 0(|27| • |G|). 

Second, “normal” context-free grammar are a special case of extended gram- 
mar. This carries over to strong LL(l)-grammars. In the full version of this paper, 
we characterize SLL(l) grammars in terms of eSLL(l) extended grammars. 

We discuss implementation and application issues in a separate paper. 
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Abstract. In this paper, we extend the characterisation of Glushkov au- 
tomata to multiplicities. We consider automata obtained from rational 
expressions in star normal form. We show that for this class of automata, 
the graphical Boolean properties are preserved. We prove that this new 
characterization only depends on conditions on coefficients and we ex- 
plicit these conditions. 



1 Introduction 

The extension of Boolean algorithms (over languages) to multiplicities (over 
series) has always been a central point in theoretical research. First, 
Schiitzenberger | 14| has given an equivalence between rational and recognizable 
series extending the classical Kleene’s theorem m- Recent works have been done 
in this area. The authors have extended the Glushkov construction to automata 
with multiplicities jB]. Lombardy and Sakarovitch have given an extension m 
of Antimirov’s algorithm based on partial derivatives [2] . The SEA software [I] 
suits to this kind of problems by allowing us to work on both Boolean and 
multiplicity automata. 

Caron and Ziadi have provided a characterisation of Boolean Glushkov au- 
tomata |7]. After some theoretical recalls, we extend this result to multiplicities 
in the case of rational expressions on star normal form. 

2 Definitions 

2.1 Classical Notions 

Let 17 be a finite alphabet, e the empty word, and (K, 0, 0) be a semiring 
where 0 is the neutral element of (K, 0) and 1 the one of (K, 0). A formal series 
j3] is a mapping S from S* into K usually denoted by S' = ^ {S\w)w (where 

(SIw) := S{w) G K is the coefficient of lu in S). Rational expressions are obtained 
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from the letters by a finite number of combinations of rational laws (+, •, and 
an external product x). Rational series are formal series that can be described 
by rational expressions. When IK = B, the external product can be omitted and 
then we talk about regular languages and regular expressions. 

A Boolean automaton M. over an alphabet S is usually defined ms\ as a 5- 
tuple (A, Q, I, F, 6) where Q is a finite set of states, ICQ the set of initial states, 
F C Q the set of final states, and 6 C Q x E x Q the set of edges. We denote 
by L{E) the language represented by the regular expression E and by L{M) 
the language recognized by the automaton Ai. A weighted finite automaton 
(WFA) |H] over an alphabet S is then a 5-tuple {E,Q,I,F,6) on a semiring 
K, and the sets I, F and 6 are rather viewed as mappings I : Q ^ K, F : 
Q — > K, and iJiQxAxQ^IK. In the following of this paper, we will need 
the original construction of Glushkov m which is summarized as follows. 
The first step is to mark out each occurrence of the same symbol in a rational 
expression E. Therefore, each occurrence of letter will be indexed by its position 
in the expression. The resulting will be denoted E, defined over the alphabet of 
indexed symbols E, each one appearing at most once in E. Glushkov defines four 
functions on E in order to compute a non necessarily deterministic automaton. 
First{E) represents the set of initial positions of words of L{E), Last{E) the 
set of final positions of words of L{E), Follow{E,i) the set of positions which 
immediately follows the position i in the expression E. Null{E) returns {e} 
if the language L{E) recognizes the empty word, 0 otherwise. These functions 
allow us to define the automaton Ai = {E, Q, 0, F, S) where 

1. A is the indexed alphabet, 

2. 0 is the single initial state with no incoming edge, 

3. Q = Pos(A)U{0} _ 

4. Vi € First{E), 6(0, ai) = {i}, Oi € E 

5. Vi € Pos(E), \/j € Follow(E,i), 6(i,aj) = {j}, aj € E 

6. F = Last(E)U Null(E)-0 

The Glushkov automaton A4 — (E, Q,0, F, 6) of E is computed from A4 by 
replacing the indexed letters on edges by the corresponding letters in the ex- 
pression E. 

We will also need the Glushkov construction defined by the authors in the 
case of multiplicities . This construction is obtained from the previous one by 
associating a coefficient to each element of every set. 



2.2 Extended Notions 

We have to define the casting A4 (resp. E) of a WFA Ai (resp. a rational 
expression E) in B. Similarly, Buchsbaum et al define the topology of a graph. 
The automaton Ai = (E, Q, I, F, 6) is then defined with I,F C Q and I = {q £ 
Q I I(<l) 7^ 0}. ^ G Q I F(<l) ^ 0}. and 5 = {(p,a,q) \ p,q G Q, a £ 

E and S((p,a,q)) yf 0}. The regular expression E is obtained from E without 
respect to the weights. Briiggemann-Klein defines expressions in star normal 
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form (SNF) j4] as expressions for which all unions of First are disjoint. Formally, 
for all subexpression H* in E, we have \/x G Last{H), Follow(H,x)nFirst{H) = 
0. We extend this definition to multiplicities. A rational expression E is in SNF 
if E is in SNF. 

The computation of the Glushkov (resp. extended Glushkov) automaton from 
a regular (resp. rational) expression E is denoted by Mb{E) (resp. Mk{E)). 

Lemma 1. Let E he a rational expression. If E is in SNF, then 

M^E) = Mt{E). 



3 Characterization of Glushkov Boolean Automata 

An automaton Ai = (A, Q, I, F, <5) is homogeneous if for all {p, a, q), {p', a', q') G 
S, q = q' ^ a = a' . As the Glushkov automaton M = (A, Q,0,F, 5) of an 
expression E is homogeneous, we can define the Glushkov graph as G = {X,U) 
with X the set of vertices (Pos(E) U{<?} U {0}), and U the set of edges (edges 
of At without label and edges from final states of At to <?). A hammock is a 
graph G = (X,U) without loop if |A| = 1 , otherwise it has two distinguished 
vertices i and t such that, for any vertex x of X, ( 1 ) there exists a path from i 
to t going through x, (2) there is no path from t to x nor from x to i. O C X 
is a maximal orbit of G if and only if it is a strongly connected component 
with at least one edge. The set of direct successors (resp. direct predecessors) of 
X G A is denoted by Q~^{x) (resp. Q~{x)). For an orbit O C X, 0+(x) denotes 
Q+(x) n (A \ O) and 0~{x) denotes the set Q~{x) n (A \ O). In other words, 
O'^(x) is the set of vertices which are directly reached from x and which are 
not in O. In{0) = {x G O \ 0~ (x) 0} and 0ut{O) = {x G O \ 0+{x) ^ 0} 

denote the input and the output of the orbit O. A maximal orbit O is stable if 
0ut(O) X ln{0) C U. A maximal orbit O is transverse if for all x,y G 0ut{O), 
0'*~{x) = 0'^{y) and for all x,y G ln{0), 0~{x) = 0~{y). 

A maximal orbit O is strongly stable (resp. transverse) if it is stable (resp. 
transverse) and if after deleting the edges in 0ut{O) x ln{0) every maximal 
suborbit is strongly stable (resp. transverse). 

Let G be a graph in which all the orbits are strongly stable. We call graph 
without orbit of G the acyclic graph obtained by recursively deleting, for every 
maximal orbit O of G, the edges in 0ut{O) x ln{0). 

A graph G is reducible if it has no orbit and if it can be reduced to one vertex 
by iterated applications of any of the three rules Ri, i? 2 , Rs described below. 
Rule i?i: If X and y are vertices such that Q~(y) = {x} and Q^(x) = {y}, then 
delete y and define Q'^(x) := Q^(y). 
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Rule R- 2 . If X and y are vertices such that Q {x) = Q (y) and Q'^{x) = 
then delete y and any edge connected to y. 




Rule R 3 . If a: is a vertex such that for all y G Q~{x), Q^{x) C Q'^{y), two 
cases have to be distinguished : 

If 0 ^ Q~{x) or ^ ^ or \X\ = 3 then delete edges in Q~{x) x Q~^{x). If 

0 G Q~{x), G Q~^{x) and \Q~{x) x Q'^(a;)| yf 1 then delete edges in Q~{x) x 

Q+(x)\{(0, <!>)}. 




Theorem 1 ([7f). G = (X, U) is a Glushkov graph if and only if the three 
following eonditions are satisfied: 

— G is a hammock. 

— Each maximal orbit in G is strongly stable and strongly transverse. 

— The graph without orbit of G is reducible. 



4 Glushkov WFA Properties for Star Normal Form 
Rational Expressions 

From now, we restrict K to a field in order to compute our characterization. 
Let us consider M. a WFA without orbit. Our aim here is to give conditions 
on weights in order to check whether At is a Glushkov WFA. Relying on the 
Boolean characterization, we can deduce that (1) At is homogeneous and (2) 
the Glushkov graph of At is reducible. 

We now define a K.-graph as a graph labeled with coefficients in K, that is 
Gk = (A, U) where X is the set of vertices and U : A x A ^ K is the label 
associated with each edge. When there is no edge between two vertices p and q, 
we have U(p, q) = 0. In case an input value is associated to the initial state, it is 
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distributed on the edges linking its successors. A left multiplication by the input 
value of coefficients of successors is applied. A Glushkov K-graph is a K-graph 
and a Glushkov graph in which output costs of final states label edges to <P. We 
now give the conditions on coefficients for applying rules, and how to obtain the 
new label on the edge (0,^) when it exists. 

Proposition 1. For a regular expression E, each edge of its Glushkov graph Ge 
is computed only once, excepting edges induced by a star and the edge between 
the states 0 and <P. 



Definition 1. A ]\L-graph is ]K.-reducible if it has no orbit and if it can be reduced 
to one vertex by iterated applications of the three rules Ri, i ?2 and R 3 where the 
following conditions are checked. 

— If there exists two states x and y such that we can apply the rule R 2 , then 

G K I Vp e Q-{x),'iq G Q+{x), = k. 

— If there exists a state x satisfying the rule R^ conditions, then G K | Vp G 

Q~{x),yq G Q+(cc), = k, and if 0 £ Q~(x) and <I> G 

then U{Q,<F) := U{Q,<F) - 1/k. 



Proposition 2. If G = {X, U) is a Glushkov K-graph without orbit then it is 
K-reducible. 

We can notice that when an expression is in SNF, the proposition [I] extends 
the single computation of edges to the star operation. 

We will now consider a graph which has at least one maximal orbit. We 
extend the notions of strong stability and strong transversality for K-graph ob- 
tained from rational expressions in SNF. We have to give a characterization on 
coefficients only. Stability and transversality notions for WFAs are very linked. 

Definition 2. Let Gk be a Glushkov K-graph. Let O be a maximal orbit. Let s be 
a state of 0ut{O) and e a state of ln{0). Let (pi, • • • ,Pm) (resp. (pi, • • • , q-n) ) be 
the ordered states ofO~{e) (resp. 0'^{s)). Let (ci,--- ,Cn) (resp. ,c^)^ 

the set of coefficients from the state s (resp. to the state e). Gk is K-transverse if 
(1) Gk is strongly transverse and (2) \/si G 0ut{O) (resp. 'icj G ln{0) ), its set 
of successors’ (resp. predecessors) coefficients is ki{ci, ■ ■ ■ , Cn) = (fciCi, • • • , kiCn) 
(resp. {c(, - ■ ■ , c'.m)kj = {c'\kl, • • • , c'^k() ). ki & K is called the output cost of st 
(resp. k( G K the input cost of ej). 



Definition 3. Let Gk be a Glushkov K-graph. Let G be a maximal orbit. Let 
s be a state of 0ut{O). Let (ei,-- - ,6™) be the ordered states of ln{0). Let 
(ci , •• • ,Cm) the set of coefficients from the state s. Gk is K-stable if (1) Gk 
is strongly stable and (2) \/si G 0ut{O), its set of successors ’coefficients is 
ki{ci,--- ,Cm) = {kiCi,--- ,kiCm)- ki is called the orbit coefficient of Si 
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We can now define the recursive version of WFA transversality and stability. 

Definition 4. A K-graph is strongly K-transverse (resp. strongly K-stable) if 
it is K-transverse (resp. K-stable) and if after deleting every edge of 0ut{O) x 
ln{0), it is strongly K-transverse (resp. K-stable). 

Proposition 3. A Glushkov K-graph in SNF is strongly K-stable and strongly 
K-transverse. 

Proposition 4. Let G = (A, U). G is a Glushkov K-graph if and only if 

~ G is a hammock. 

— G is strongly K-stable and strongly K-transverse. 

— The graph without orbit of G is K-reducible. 

— For each maximal orbit O the set of output coefficients is exactly the set 
of orbit coefficients. After deleting 0ut{O) x ln{0) edges, this property is 
preserved. 

5 Conclusion 

We have extended here the characterization of Glushkov automata to multiplici- 
ties in the SNF case. As far as non-SNF expressions are concerned, the difficulty 
lies in the fact that when computing the automaton, some edges may be com- 
puted several times (coefficients are added) and some may be deleted. 
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Abstract. This paper compares various methods for constructing min- 
imal, deterministic, acyclic, finite-state automata (recognizers) from sets 
of words. Incremental, semi-incremental, and non-incremental methods 
have been implemented and evaluated. 



1 Motivation 

During last 12 years, one could see emergence of construction methods special- 
ized for minimal, acyclic, deterministic, finite-state automata. However, there 
are various opinions about their performance, and how they compare to more 
general methods. Only partial comparisons are available. What has been com- 
pared so far was complete programs, which performed not only construction, 
but computation of a certain representation (e.g. space matrix representation, 
or various forms of compression). 

The aim of this paper is to give answers to the following questions: 

— What is the fastest construction method? 

— What is the most memory-efficient method? 

— What is the fastest method for practical applications? 

— Do incremental methods introduce performance overhead? 

The third question is not the same as the first one, because one has to 
take into account the size of data and available main memory. Even the fastest 
algorithm may become painfully slow during swapping. 



2 Construction Methods 

Due to lack of space, the construction methods under investigation have only 
been enumerated here. The reader is referred to the bibliography for proper de- 
scriptions. Some of the methods use a structure called the register of states. In 
those algorithms, states in an automaton are divided into those that have been 
minimized, i.e. they are unique in that part, and other states, i.e. those that 
are to be minimized. The register is a hash table, and two states are considered 
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Table 1. Characteristics of data used in experiments 





words 


strings 

characters 


av. len. 


automaton 
states trans. 


German words 
morph. 


716 273 10 221410 
3 977448 364 681813 


14.27 

91.69 


45 959 97 239 
107198 435 650 


French words 
morph. 


210 284 
235566 


2 254 846 
17111863 


10.72 

72.64 


16 665 
32 078 


43 507 
66 986 


Polish words 
morph. 


74 434 
92 568 


856176 
5174 631 


11.50 

55.90 


5 289 12 963 
84 382 106 850 


/usr /diet /words 


45 407 


409093 


9.01 


23109 


47346 



equivalent if they are either both final or both non-final, and they have the same 
transitions (the same number, labels, and targets). Methods 1, 3, and 4 require 
sorted data; the strings must be sorted lexicographically, lexicographically on re- 
versals of strings, and on decreasing lengths respectively. The following methods 
have been investigated: 

1. Incremental construction for sorted data ([2])- 

2. Incremental construction for unsorted data (0). 

3. Semi-incremental construction by Bruce Watson (0)- 

4. Semi-incremental construction by Dominique Revuz (a)- 

5. Building a trie and minimizing it using the Hopcroft algorithm m, [II)- 

6. Building a trie and minimizing it using the minimization phase from the 
incremental construction algorithms (postorder minimization). 

7. Building a trie and minimizing it using the minimization phase from the 
algorithm by Dominique Revuz (lexicographical sort [Hj). 



3 Experiments 

Data sets for evaluation were taken from the domain of Natural language Pro- 
cessing (NLP). Acyclic automata are widely used as dictionaries. Both word 
lists, and morphological dictionaries, for German, French, and Polish, as well 
as a word list for English were used. Word lists and morphological dictionar- 
ies have different characteristics. Strings in word lists are usually short, sharing 
short suffixes. Strings in morphological dictionaries are much longer, with long 
suffixes shared between entries. The data is summarized in Table [H 

All methods were implemented in a single program, with data structures 
and most functions shared among different algorithms. Unfortunately, there is 
not enough space in a short paper to describe the implementation in detail. In 
the experiments, the hash function had 10001 possible values. The register was 
implemented with an overflow area for each hash value being a set class from 
the C-| — h standard template library - a tree-like structure. 
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Fig. 1. Memory requirements for English words. 
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Fig. 2. Execution time for Polish words 



Figure [T] shows how the number of states grows during construction of an 
automaton for a representative data set using various algorithms. The points la- 
beled “trie” represent non-increment al methods. Memory requirements for con- 
struction algorithms are proportional to the number of states of the largest au- 
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Fig. 3. Execution time for French morphology 



tomaton during construction. The diagram shows that only incremental methods 
keep the automaton minimal throughout the process - other methods require 
memory for additional redundant states before they arrive at the minimal au- 
tomaton. The intermediate automaton for non-incremental methods can be much 
larger than minimal. Memory requirements for the incremental method for un- 
sorted data are identical to those for the sorted method on the same data, and 
only slightly higher for data sorted for other methods. The Watson’s algorithm 
displays an unusual behavior. The largest number of states is achieved approx- 
imately half way during the construction process. This phenomenom is caused 
by two factors: sorting of input data (from longest strings to the shortest ones), 
and minimization scheme (prefixes cause minimization of larger words) . Initial 
memory requirements for Watson’s algorithm are higher than for a trie made 
from data sorted lexicographically, as longer words come first. They become 
lower towards the end of data, as shorter words trigger minimization. 

To test the relation between the speed, and the size of the data, each algo- 
rithm was tested on 0.1, 0.2, . . . 0.9, and on the whole data. In case of Revuz’s 
algorithm, and Watson’s algorithm, those parts were sorted accordingly, instead 
of taking the same number of words from the beginning of the whole file sorted 
according to the requirements. This is different than measurements of memory 
requirements, because they were all taken during a single run on the whole ap- 
propriate file. Also, due to multiuser, multitask unix environment, only processor 
times were measured, not the elapsed “real” time. It means that the effects of 
swapping do not show up on diagrams. Only initial values for trie -I- Hopcroft 
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minimization algorithm are shown to underline differences between other al- 
gorithms. For most data, the trie -I- postorder minimization method was the 
fastest. It was slightly faster than the algorithm for sorted data, and in some 
cases their values are not distinguishable on diagrams. For morphological dictio- 
naries, Revuz’s algorithm was faster (Fig|3). This happens because in that data 
very long common suffixes were present. The INTEX program uses Revuz’s 
algorithm without pseudo-minimization phase to save both time and disk space, 
but annotations are kept short, and their expansions are kept elsewhere. 

4 The Fastest Algorithm 

Surprisingly, the fastest construction algorithm is not yet described in literature. 
This is probably due to its simplicity. We define a deterministic finite state 
automaton as M = (Q, S, 6, qo,F), where Q is the set of states, S is the alphabet, 

5 G QxE — !• QU{T} is the transition function, qq is the initial state, and F C Q 
is the set of final states. A somewhat formally awkward notation of assignment 
to the delta function in the algorithm below means creating or modifying a 
transition. This algorithm has exactly the same complexity as both incremental 
algorithms from [^. 

func trie_plus_postorder_minimization; 

start_state := construct.trie; Register := 0; 
postorder _minimize(start .state); 
return start.state; 

cnuf 

func construct_trie(); 
start := new state; 
while file not empty 

word := next word form file; i := 0; s := start; 
while i < length(word) 

if S{s,wordi) T ^ S{s,wordi) := new state; fi 
s := S{s,wordi); i := i -I- 1; 

elihw 

F := FU {s}; 

elihw 

cnuf 

proc postorder jninimize(s); 
foreach a G S : 6{s, a) T 
postorder uninimize((5(s, a)); 

if Register ^(s, u) = Q > J(s, u) — (7, 

else Register := Register U{s}; fi 

hcaerof 



corp 
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5 Conclusions 

— For unsorted data, the trie + postorder register-based minimization algo- 
rithm is the fastest, provided that we have enough memory to use it. The 
difference between the minimal automaton and the corresponding trie can 
be huge. 

— All but incremental methods produce intermediate automata much larger 
than the minimal ones. All alternatives to the incremental algorithm for 
unsorted data build a trie first - the worst possible case from the point of 
view of memory efficiency. Therefore, the incremental algorithm for unsorted 
data is the fastest algorithm for unsorted data in practical applications, and 
in fact the only algorithm for that purpose. 

— For typical sorted data, the trie -I- postorder register-based minimization 
algorithm can be used, but as it builds a trie, it requires huge amounts 
of memory. The incremental algorithm for sorted data can be used instead 
with almost no performance penalty. For sorted data where strings share long 
suffixes, like in certain morphological data, Revuz’s algorithm is the fastest. 
However, such data can easily be transformed so that the long suffixes are 
stored separately (as it is done e.g. in INTEX). For the transformed data, 
Revuz’s algorithm is no longer the fastest one. It also requires non-standard 
sorting that cannot be performed efficiently using ready-made programs. 
Moreover, tools for constructing natural language morphologies have addi- 
tional data that can be used for faster construction algorithms. The author 
has implemented such an algorithm. 

— Both fully incremental algorithms are the most economical in their use 
of memory. They are orders of magnitude better than other, even semi- 
incremental methods. Even for a very small data set, like the one presented 
on Figure [T] the intermediate automata in semi-incremental methods are 3-4 
times larger than the minimal ones. 

— Both incremental algorithms are the fastest in practical applications, i.e. for 
large data sets. The algorithm for sorted data is the fastest, but if data is 
not sorted, sorting it (and storing it in memory) may be more costly than 
using the algorithm for unsorted data. However, when the same data is used 
repeatedly, sorting is always beneficial. 

— It seems that incremental algorithms do not introduce much overhead when 
compared to non-incremental methods. The differences between the incre- 
mental sorted data algorithm and its non-incremental counterpart are mini- 
mal. The non-incremental version of the semi-incremental Revuz’s algorithm 
(trie -|- lexical sort) is sometimes faster than the original version for words, 
and always slower for morphologies. 

— Trie -I- Hopcroft minimization is the slowest algorithm. While all other al- 
gorithms are linear, this one has an additional 0(log(n)) overhead, and it is 
quite complicated compared to register-based algorithms. 

Acknowledgements. The outline of experiments was discussed with Bruce 
Watson. This research was carried out within the framework of the PIONIER 



Comparison of Construction Algorithms 261 



Project Algorithms for Linguistic Processing, funded by NWO (Dutch Organi- 
zation for Scientific Research) and the University of Groningen. The program 
used in the experiments is available from 
http : //www . eti .pg.gda.pl/~jandac/adfa.html. 



References 

1. A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer 
Algorithms. Addison- Wesley Publishing Company, 1974. 

2. Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson. Incremental con- 
struction of minimal acyclic finite state automata. Computational Linguistics, 
26(1):3-16, April 2000. 

3. John E. Hopcroft. An n log n algorithm for minimizing the states in a hnite au- 
tomaton. In Z. Kohavi, editor, The Theory of Machines and Computations, pages 
189-196. Academic Press, 1971. 

4. Dominique Revuz. Dictionnaires et lexiques: methodes et algorithmes. PhD thesis, 
Institut Blaise Pascal, Paris, Prance, 1991. LITP 91.44. 

5. Dominique Revuz. Minimisation of acyclic deterministic automata in linear time. 
Theoretical Computer Science, 92(1):181-189, 1992. 

6. Max Silberztein. INTEX tutorial notes. In Workshop on Implementing Automata 
WIA99 - Pre-Proceedings, pages XIX-1 - XIX-31. 1999. 

7. Bruce Watson. A fast new (semi-incremental) algorithm for the construction of 
minimal acyclic DBAs. In Third Workshop on Implementing Automata, pages 91- 
98, Rouen, France, September 1998. Lecture Notes in Computer Science, Springer. 




Term Validation of Distributed Hard Real-Time 

Applications 



Gaelle Largeteau and Dominique Geniet 



Laboratoire d’Informatique Scientifique et Industrielle, 
Universite de Poitiers & ENSMA, 

Teleport 2-1 avenue Clement Ader 
BP 40109 86961 Futuroscope Chasseneuil cedex, France 
{largeteau,dgeniet}@ensma. fr 



Abstract. To validate real-time systems, one must especially validate 
on the one hand its functional behaviours (by proving that it does what 
it must do), and on the other hand its operational behaviours (by proving 
that it respects its time specifications). Here, we deal with the operational 
aspects. In previous works, we presented a technique, based on finite 
automata, to validate real-time systems designed to run on a centralised 
architecture. Here, we extend this approach to distributed systems. The 
main contribution of this work is to show that, when the modeled physical 
process is closed, hnite automata and product operators are sufficient to 
valid distributed systems on an operational way. 



1 Introduction 

A real-time system is reactive and concurrent (all operations associated with 
a process managing have to run simultaneously). It is a set of elementary tasks, 
each of them coding a reaction to incoming events. This set is composed of 
periodic and non periodic tasks (related to alarm signals and user actions). 
Validity of a real-time system is based on both the correctness of its results 
and its conformity to timing constraints. There are two classes of real-time sys- 
tems: hard and soft systems. If not respecting terms implies irretrievable con- 
sequences, the system is hard, otherwise the system is soft. Here, we deal with 
hard real-time systems composed of periodic tasks. 

Validating a real-time system consists in proving that it will always be able 
to react in conformity with its timing constraints, whatever the incoming event 
flow. The term validation is then a decision process related to task schedul- 
ing sequences. It usually follows two main approaches: first, in-line approach 
consists in choosing the task to elect for any context switch during the applica- 
tion run. Since computing task system scheduling issue with critical resources 
is NP-complete [2], this approach is not optimaf] for almost every task configu- 
ration, and it has an exponential complexity. To solve this problem, the off-line 

^ A scheduling algorithm is optimal if and only if it gives a valid scheduling sequence 
when there exists one. 
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approach uses formal models to search for existence of at least one scheduling 
sequence satisfying constraints (through model checking techniques). 

The fast technological evolution in recent years (especially in network com- 
munications) has resulted in using distributed systems as a base for hard real- 
time applications more and more frequently. These systems stand founded on 
real-time protocols integrating timing constraints for messages transmissions. 
Since scheduling distributed systems is NP-difhcult, the in-line approach is still 
not optimal. We use the off-line approach defined in to suggest a method 
that validates distributed systems on an operational way. Its principle is to inte- 
grate communication protocols in the model and to adapt the model to targets 
with different processor speeds. The physical architecture is composed of a set 
of sites, that communicate through a network. Each site dispose of many pro- 
cessors, and of a RAM shared between its processors. All the processors follow 
the local clock of the site. Moreover, each site dispose of a network board, which 
contains a specialised processor. Real-time system validation involves the fore- 
seeability of behaviors. Since using cache or pipeline induces nondeterminism, 
they are always disabled in real-time systems: here, we make the assumption, in 
the framework of validation, that none are used. 

Firstly, the model is presented in the framework of centralised real-time sys- 
tems with fixed execution time tasks. Then, we describe a modelling technique 
for distributed real-time systems integrating speed differences. We show how 
product automata can be used to model simultaneity in distributed systems. We 
assume that there is no task migration. The task placement is not considered. 

2 Centralised Systems Validation 

2.1 Task Temporal Modelling 

A real-time software is a set of atomic tasks. We denote (ri)ig[i_„] such a system. 
Each task is specified by: its arrival time its deadline Di, its period T^, and 
Ci, the CPU execution time of each instance of the task. Parameters rj, Di and 
Ti come from the specifications of the external system, but Ci depends on both 
the code of the task and the performances of the target processor. We assume 
that all atomic statements have the same duration of one time itrtiil. Hence, 
Ci is constant in time units. On the opposite, r^, Di and Ti do not depend on 
the CPU frequency: they are not fixed in time unit. In the following, we use 
this property to build a model language which takes into account both term 
specifications, and performances of the processor. 

Let r be a task of timing constraints r, D and T. The code ^ of r is a word 
over the set P of atomic statements. The execution duration C of r is extracted 
from this code. During its execution, r can be active or suspended whether it 
owns the processor or not. Then, for each time unit, observing r state allows to 
build an activity/inactivity sequence for r. The set of sequences that respects 
the specified timing constraints is the r temporal model. Our aim is to build this 

^ Time unit is the duration of a non-preemptive instruction (assembler). 
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Fig. 1. Task Model 



set, i.e to build the regular language L(t). We consider time as implicit: each 
task processes one action by time unit. Letter a models the activity state of r 
for one time unit, and • models its suspended state for one time unit. We note 
S={a, •}. The temporal model associated with ^ is the word ^(C), where 4> is 
the concatenation morphism P {a}. The length of 4>{^) is the duration C of r. 
The r temporal model Lu(t) is obtained by given the system inactivity periods 
(using •). We use the Shufft^ operator III, the generic expression of the model 
is given in [3] bjHL(r) = Each word w of this 

language has got the same length T and is called valid temporal behavior of 
r. The automaton associated with L(t) have a generic pattern that depends on 
the temporal features of the application (see figured]) . 

Task T is running on a processor with particular temporal features : we define 
a time unit as the time interval between two clock ticks, and a cadence as the 
inverse of its duration. In the multi-processor case, all processors of a same site 
work in a synchronous way. The duration of r is then equal to | (/)(^) | x ug on 
site S, and | | x itr on site T. The r temporal features (D, T, etc.) are 

no longer expressed in the language directly as the occurrence number of •, but 
as the occurrence number of • that is necessary to model the inactivity time 
corresponding to the target processor. We note L^i(r) the set of r valid temporal 
behaviors on a processor that have u for time unit. 

Example 1. Let be r with (r,D,T,C)=(3ms, Sms, 10ms, 3t.u0). A r model can be 
Tims(T) = C'enter(*3((*®IIIa^)*^)*), or L 250 a«s(t) = C'enter(*^2((*2®IIIa3)*®)*). 

The rate between a’s and *’s depend on the target cadence. In the following, we 
show that task r, defined with temporal characteristics (r,D,T,C), and designed 
to run on a processor with a cadence c(u=l/c), can always be associated with 
a regular language L^i(r). In order to integrate processor features in the model, 
values expressed in seconds (r, D, T) have to be converted in time units. Let u 
be the time unit associated with the processor, x seconds correspond to ^ t.u. 

^ Shuffle{ III), is defined by the formula: Va G If, a IHe = a and V(a,b,w,w’) G S^x 
aw IHbw’= a (w Illbw’) U b (aw IHw’). 

^ The L center is the set of L prefixes indefinitely extendable in L, algebraically, 
Center( L*)=L*.LeftFactors(L). 

® t.u. for time unit 
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on this processor. The values in seconds of r, D and T equal respectively to 
and ^ t.u.. Usually, timing constraints r, D and T are of the order of 10“^s and 
the time unit u is of the order of 10“®s. Since Q is dense in M, we can assume 
that ^ and ^ are integers. By using the same approach than in [2, we get 
Lu{t)= Center(*u as task model. We note f the set 

{L„ (t),uGQ*+}. 

2.2 Validation 

We have defined in is a technique, based on the Arnold-Nivat jl| model, to 
collect all valid scheduling sequences of a task system. The principle consists 
first in associating each critical resource Rj (processor, resource, message, etc.) 
with a virtual task V^j (modeled by a regular language L(V_rj)), and then as- 
sociating the system with the homogeneous product of the L(ri) and 

the L(V_ro). Let call S the subset of n (Si) of vectors describing valid config- 

iei 

urations (respecting mutual exclusion on processors or resources). We prove in 
0] that languag41Proj/(Center(L(r)ig/nS*)) collects the set of valid scheduling 
sequences, from resources management and timing constraints point of view. 

Validating a real-time tasks system (Ti)ig/ consists in deciding if the config- 
uration (ri)ig/ can be scheduled in conformity with its time constraints. This 
decision is reached by evaluating the predicate {center{Q^^^s{L{Ti))) — 0) using 
an automaton associated with the language. If the language is empty, there exists 
a valid temporal behavior, then the configuration can be scheduled, otherwise 
there is no way to schedule the system. 

3 Model for Distributed Systems 

A distributed system is defined by a lack of common memory, a use of communi- 
cation system and by the fact that there is no global state that can be observed 
|5]- Such a system is characterised by a set of sites, running with different speeds. 
Each site has a local clock that does not depend on others and that is the refer- 
ence for every processor of the site. A clock is defined as an increasing sequence 
depending on time. A model that collects behaviors of a distributed system must 
be able to express simultaneity of different tasks placed on different sites, with no 
assumption concerning both a global time and correlations between the different 
speeds of the sites. 

The model presented in section[2]is useful to validate real-time systems placed 
on a single site, possibly multi-processor. The speed of the site is implicitely 
modeled by the the time unit associated with the labels of the edges of the 
product automaton which models the software. A distributed system can be 
viewed (on a model way) as a set of such automata, each automaton being 
associated with its own time unit. As far as the target architecture is known and 
static, we know a priori the different speeds of the differents sites. To build an 



Proj/(Center(L(r)igjnS*)) is noted s(i^{Ti)) in the following. 
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automaton that collects all behaviors of the system, we need two tools. First, a 
zoom technique, to accord the different automata with the same time semantics: 
we can not give a semantics to a product automaton AQB, when A and B do not 
share the same time semantics. Second, a start result, to show that respective 
starting times of different sites have no incidence on the time validity of the 
software. The zoom technique is presented in section 13.11 The start result is 
obtained as an obvious corollary of properties of words of a regular language 
center. 



3.1 Zoom Languages 

In T building process (recall that r={L„ (t) ,u £ Q*’’"}), we take various CPU 
speeds into account. We obtain a language class which satisfies the following 
property: for each word of each language of r, the rate between a’s and »’s 
is a function of the cadence. Consider G f. We call granularity the time 
associated with each letter duration into L„ words. It is the time semantics 
of each edge of the automaton. We note gLu the observation of with the 
granularity g (i.e. the zoom rate ^). 

Example 2. : let L = {a} be a language associated with the time unit 1ms: 
the duration of a is 1ms. Consider the language M={a"} associated with 
the duration ^ ms. L and M share the same semantics, because they both 
contain the same behaviour, which semantics is ”t is in the a state during 
1ms”: |a| X 1 = |a"| x On the opposite, we observe that N={ a”} associated 
with a duration of is not equivalent to L : |a|xly^ |a"| x ^ 

observation of L with the granularity ^ . 



To build the product and Ly must be observed with the same gran- 

ularity. Then, we must be able to get g G Q such that both gLy and gLy exist. 
Then, we must build the set Lu{r) of languages that collects behaviors of L„ in 
different granularities, i.e. the set of gLyir) associated with task r running on a 
site that has u as a time unit and observed with a granularity g. To build L„(r), 
we use the isomorphism ip, defined by: 

4’u,g ■ Pc U {a, •} ^ {Pc U {a, •l)^ such that k = u/g, 

Wx ^ Pc, i’ix) = ; 

Va: G {P, 5}, V’(a;) = a^~^.x ; 

Va: G {V, R}, ii){x) = x.a^~^. 

Then L„(r)={gL„(T),gL,,(T) CS* /3gGQ+* ,uGgW , gLu(r)='0u.g (T«('t))}. 
Given g^Ly^ and g^Ly^- We remark that g-^Ly-^flg^Ly^ have a time semantics if 
and only if gi=g 2 - The model expresses simultaneity through the homogeneous 
product 17 of languages, our goal is then to find a granularity g which, applied 
to all sites, gives a temporal semantics to the product. To reach this aim, we 
extend in a natural way the CCD notion to Q (CCD operator is noted A). 
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Site 1 Site 2 



Fig. 2. Distributed application example 



Theorem 1. Given and Lu^{t 2)- Then, 3 g£ g=u\/\U2 and 

gLu^(Ti)<zS* , gLu^(T2)cE* such that gLu,{Ti)GLu^{Ti) and gLu^{T2)GLu^{T2) 

Obtained languages are maximal: ^ g’>g/g'L„i (n), g'Lu^{T2)G Lu^{t2)- 

Moreover, ij} gives a constructing algorithm for this languages class. This theorem 
gives a technique to build a set of languages sharing the same granularity. This 
set allows to use homogeneous product for the composition of systems placed on 
different sites. This approach stands whatever the site speeds and start times 
and it can be used in the frame of multi-processor centralised systems that do 
not have a global clock. 

3.2 Communication Integration 

In the previous part, we have established our model validity in the distributed 
case. We apply the homogeneous product to languages corresponding to each 
site. We note L„^(S„) the language associated to the site S„. 

Let (Ltj, (Si))igj be the set of languages associated with sites. We use The- 
orem 1 (section [ 3 TJ : let be G=Aigj(ui), and {cL{Si))i^j the set of languages 
such that: V* G J, L{Si) = 4 >ui,G{Tui{Si))- Languages J are all built 

on the same granularity G, we can therefore build qL = {Ti{GL{Si))i^j) . gL 
gives the system {Ti)i^i model on sites (Sj)j^j. 

To temporarily validate this system, we have to integrate communication 
protocols into the model. Our aim is to warrant that message transmissions stay 
in temporal terms. 

Then, it is necessary to have a model for network behaviors (see figure 0 . To 
obtain the model for all the drivers, we first model one of them, and then their 
simultaneous run using the Arnold-Nivat’s product. The driver task is duplicated 
in order to run on each site of the system, on a dedicated processor (the network 
board GPU). The model language that collects driver behaviors is called D, 
it is built on the network. The language D has the granularity gjg of one bit 
transmission duration. 

For all j in J, is the driver i associated language. All drivers share 

the same code and then the same language. Let Prot be the synchronisation set 
expressing protocol communication constraints. Then ProtigoD), 

is the model for the network (see figurel^). This method can be used for any 
protocol that supports an automaton based model. 

We check then the compatibility of message transmission and application 
timing constraints. To warrant application terms, a message must be transmitted 
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Fig. 3. Driver and Network automata 



in a limited time: it has a deadline. We use a virtual task Stw (stopwatch) that 
keeps a record of the elapsed time between a message Send and its Receive. 

This deadline, associated with a network model, allows to decide the com- 
patibility between transmission and timing constraints. Granularity of Stw is G. 
Let gMsg=Projstiu(GL f2sr G^tw), computed with a resource synchronisation 
(Sr) on the stopwatch [sec 12. 2) . To test transmission validity, we must compute 
the languages cMsg and homogeneous product. We apply Theorem 1: using 
H=GAg, we get _f/Msg and _yR. 

Language ijL=Genter(ijMsgf? 5 rffR) collects the set of valid messages 
scheduling (respecting timing constraints) on the network. The validity test is 
the same as in centralised system validation: If //L=0 then there is no valid 
behavior, otherwise, there exists at least one. 

4 Conclusion 

Languages L„(r) are useful to validate hard real time distributed systems, if 
they are based on protocols that can be modeled by regular languages. The 
centralised model was extended by considering processors speed and by defining 
a zoom operation on languages. This last tool, associated with a generalisation 
of GGD to Q, is useful to model with finite automata distributed systems with 
no addition of restrictive hypothesis (the only one is the closure of the modeled 
system: this is not a restriction when considering real-time systems!). 

The result is a schedulability decision for the application on a distributed 
architecture. One of the central corollaries of this approach is the cyclicity of 
scheduling sequences in distributed multi-processor environnement: this result 
is an immediate corrolary (star lemma) of the fact that valid scheduling set is a 
regular language. 

This work is ongoing. Our present studies concern both the integration of task 
migration and the integration of a small level of non determinism by considering 
alarm events. 
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Abstract. Given a set of strings, a common subsequence of this set is 
a string that is a subsequence of each string in this set. We describe an 
on-line algorithm building the finite automaton that accepts all common 
subsequences of the given set of strings. 



1 Introduction 

A subsequence of a string T is any string obtainable by deleting zero or more 
symbols from T. Given a set P of strings, a common subsequence of P is a string 
that is a subsequence of every string in P. Problems on subsequences arise in 
many areas, e.g. in molecular biology and coding theory. One of the problems 
with great practical impact is the longest common subsequence (LCS) problem. 
The problem is to find, given a set P of strings, a common subsequence of P 
that has maximal length among all common subsequences of P. If the number 
of strings in P is not bounded, the problem is NP-complete, as was shown by 
Maier [9]. 

The algorithms for the LCS problem are usually divided into two groups 
in literature: the algorithms for the LCS of two strings, and the algorithms 
for the LCS of three or more strings. This separation is sensible, because many 
algorithms from the former group do not have any straightforward generalization 
for three or more strings. We shortly mention several algorithms from both 
groups. 

The first solution of the LCS problem of two strings was probably dynamic 
programming, which was discovered independently by several scientists. Im- 
provements were described by Hirschberg [Ij and Hunt and Szymanski [Zj. An 
algorithm with the best known worst-case time complexity was given by Masek 
and Paterson m- Itoga [Hj extended the dynamic programming for the case 
of arbitrary number of strings. Hsu and Du [B] introduced a common subse- 
quence tree. Crochemore and Trom'cek [2] described an automaton that accepts 
all common subsequences of the given strings and gave an off-line algorithm for 
its building. 

Another problem on subsequences is to decide, for a string S and a set P of 
strings, whether S' is a subsequence of P. Preprocessing strings in P allows to 
solve the problem in time linear in the length of S. Baeza-Yates [T] described the 

* This research has been supported by GACR grant No. 201/01/1433. 
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automaton accepting all subsequences of given set of strings and a right-to-left al- 
gorithm for its building. This automaton is called Directed Acyclic Subsequence 
Graph (DASG). Grochemore and Trom'cek gave a left-to-right algorithm. An 
on-line algorithm is from Hoshino et al. . 

In this paper, we describe an on-line algorithm building the GSA. The lan- 
guage accepted by the GSA is a subset of the language accepted by the DASG 
for the same strings. This implies similarity between the GSA and DASG. The 
on-line algorithm building the GSA described below is modification of the al- 
gorithm from [S] . A possible application of the GSA is the LGS problem or the 
problem of separating two sets of strings. Given two sets P and N of strings, we 
say that a string S separates P and fV if S' is a subsequence of P and simultane- 
ously S is not a subsequence of any string in N. The problem has application in 
discovery science and machine learning [3] . We slightly reformulate the problem: 
given two sets P and N of strings and a string S, we ask if S separates P and 
N. We assume that the problem should be answered for several different strings 
S. Then it makes sense to preprocess strings in P and N. With the GSA for P 
and the DASG for N, we are able to answer the question in time linear in the 
length of S. 

The paper is organized as follows. In section [2] we recall the definition of the 
GSA from [2], in section |2] we prove two properties of the GSA, and in section 2] 
we describe an on-line algorithm building the GSA. 

Let A be a finite alphabet of size a and e the empty word. A finite automaton 
is, in this paper, a 5-tuple {Q, A, 6, qg, A), where Q is a finite set of states, A is 
an input alphabet, 6 : Q x P ^ Q is a transition function, qg is the initial state, 
and F C Q is the set of final states. Notation (i,j) means the interval of integers 
from i to j, including both i and j. All strings in this paper are considered on 
alphabet A. 

2 Definition of CSA 

Let P denote a set of strings Ti, T 2 j • ■ ■ ; Tfe. Let be the length of Ti and Ti[j] 
be j-th symbol of Ti for all j G (1, rn) and all i G (1, k). Given T = tit 2 ■ ■ - tn 
and i, j G (1, n) , i < J , notation T[i . . . j] means the string titi+i . . .tj. 

Definition 1. We define a position point of the set P as an ordered k-tuple 
[pi,P 2 , ■ ■ ■ iPk], where pi G (0,ni) is a position in string Ti. If Pi G (0,ni — 1) 
then it denotes the position in front of (pi + l)-th symbol of Ti, and if pi = Ui 
then it denotes the position behind the last symbol of Ti for all i G (l,fc). 

A position point [pi,p 2 , ■ ■ ■ ,Pk] is called initial position point if = 0 for all 
i G (1, k). We denote by ipp the initial position point and by Pos{P) the set of 
all position points of P. 

Definition 2. For a position point [pi,P 2 , ■ ■ ■ G Pos{P) we define the eom- 
mon subsequence position alphabet as the set of all symbols which are contained 
simultaneously in Ti[pi + l . . .ni ], . . . ,Tk[pk+I ■ ■ ■ rik], te. Ecp{[pi,P 2 , ■ ■ ■ ,Pk]) = 
{a G A : Vf G (l,fc)3j G {pi + l,ni) : Ti[f\ = a}. 
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Definition 3. For a G S and a position point [pi,p 2 , ■ ■ ■ ,Pk] G Pos{P) we 

define the eommon subsequence transition function: 

csf{[pi,P 2 , ■ ■ ■ ,Pk],a) = [ri, T 2 , . . . , Tk], where ri = min{j ■- j > Pi and 

Ti[j] = a} for all i G (l,k) if a G Scp{[pi,P 2 , ■ ■ • ,Pk]), and 

csfi[pi,p 2 , ■ ■ ■ ,pk],a) = 0 otherwise. 

Let csf be reflexive-transitive closure of csf 



Lemma 1. The automaton {Pos{P),S,csfipp,Pos{P)) accepts a string S iff 
S is a subsequence of P. 

Proof See [^. □ 

The automaton from lemma [1] is called Common Subsequence Automaton 
(CSA) for strings Ti,T 2 , . . . ,Tk. 

We briefly describe an off-line algorithm building the CSA. The algorithm 
generates step by step all reachable position points (states). At each step we 
process one position point. First, we will And the common subsequence position 
alphabet for this point and then determine the common subsequence transition 
function for each symbol of that alphabet. When the position point has been 
processed, we continue with a next point until transitions of all reachable position 
points are determined. The complexity of the algorithm depends on the number 
of states of the outcoming automaton. If the total number of states is 0{t) then 
the algorithm requires 0{kat) time. 

An ^-dominant match (also called minimal) of two strings is an ordered pair 
[ii , * 2 ] such that Ti[ii] = T 2 [ 12 ] , the length of the LCS of Ti [1 . . . ii] , T 2 [1 . . . 12 ] 
is equal to £, and the length of the LCS of pairs Ti [1 . . . ii — 1] , T 2 [1 . . . 12 ] and 
Ti [1 . . . ii], T2[1 • • ■ *2 — 1] is less than 1. We recall generalization of dominant 
matches for more than two strings: an ^dominant match oi Ti, . . . ,Tk is an or- 
dered A:-tuple [A, . . . , ik] such that Ti[ii] = ... = Tk[ik\, the length of the LCS of 
Ti[l ... ii], ... ,Tk[l ... ik] is equal to £, and for arbitrary permutation [ji, . . . , jk] 
of [1, 0, 0, . . . , 0] is the length of the LCS of Ti[l . . .ii — ji], . . . , Tfc[l . . .ik — jk\ 
less than £. 

3 Properties of CSA 

First, we show the correspondence between the dominant matches and states of 
the CSA. 

Lemma 2. Given a set P of strings, a position point p G Pos{P) is reachable 
iff p is a dominant match of strings in P. 

Proof. We prove two implications: 

1. If p is a dominant match of strings in P then p is reachable. Let p = 
[pi,P 2 , ■ . . ,Pk] be an ^-dominant match and let 0102 . . .ai be the longest com- 
mon subsequence of strings Ti [1 . . . pi] , T 2 [1 . . . P 2 ] , ■ • ■ , [1 . . . p*,] . If the longest 

common subsequence is not determined uniquely, we can choose an arbitrary 
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one. Obviously, ai 02 ■ ■ ■ ai determines the ^dominant match. For two distinct i- 
dominant matches we obtain the distinct longest common subsequences. Hence, 
csf* {ipp,aia 2 ■ ■ ■ ai) = [pi , P 2 > • • • ? Pfc] and so each dominant match is also a 
reachable position point. 

2. If p is a reachable position point then p is a dominant match of strings in P. 
Since p is reachable, there exists path ai 02 . . . such that csf*{ipp, a\a 2 ■ ■ ■ at) = 
p. This path is a common subsequence of strings in P. We find the shortest prefix 
of each string in P such that this prefix contains the common subsequence. A 
dominant match is determined by these prefixes. □ 

We will show that the number of dominant matches of two strings can be 
quadratic in the length of the input strings. The proof is based on the same idea 
as the proof of quadratic growth of the number of states for the DASG in . 

Lemma 3. Let T\ = (a 6 )^“,T 2 = (bab)^, where x Z^x > 1 and let R denote 
the set of all reachable position points o/Ti,T 2 . Then, R contains the position 
points [2(i + j) — 3, 3f — 1] for all i G (1, x) and all j G (l,i + 1). 

Proof, (by induction): 

l.i = l: Clearly, [1,2], [3,2] G R. 

2. We write the position points for i (from the lemma): [2f — l,3i — 1], [2i + 
1, 3z — 1], . . . , [4i — 1, 3i — 1]. Further, we find out the transitions for each of these 
position points and generate new points: 

[2t-l,3i-l] A [2i + l,3i + 2] 

[2t + l,3i-l] A [2f + 3,3t + 2] 

[4t-l,3i-l] A [4i + l,3i + 2] 

[4i - 1 , 3 i - 1 ] 4 [4i, 3z] 4 [4i + 2 , 3 f + 1 ] A [Ai + 3, 3f + 2] 

The new generated points [2i + 1, 3z + 2], [2* + 3, 3f + 2], . . . , [Ai + 1, 3* + 2], [Ai + 

3, 3* + 2] are exactly the same as the points from the lemma for i + 1. □ 

4 Building CSA 

We will describe (informally) an on-line algorithm building the CSA for a set of 
strings Ti,T 2 , . . . ,Tk- In the first step, we build the DASC for Ti (which is also 
the CSA for Ti), and then in each subsequent step we load the next string. That 
is, after loading j-th string, we have the CSA for Ti, T 2 , . . . , T^. 

We will explain the idea of loading a string into the automaton. The states 
of the automaton are divided to active and non-active. At the beginning, before 
appending the first character of the string, only the initial state is active. Once 
a state becomes active, it remains active until the whole string is processed. We 
process each string character by character. When a character is being loaded, we 
examine transitions from all active states labeled with this character. Let q be a 
state where any (one or more) such transitions end. Two situations can occur: 
( 1 ) all the transitions that lead to q start in active states - then no splitting of 
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q is needed, or (2) at least one input transition of q starts in a non-active state 
- then q must be split to two states. In this case, we create a new state with the 
same output transitions, and redirect all transitions leading to q and starting 
in an active state to the copy. The new active states are found as targets of 
transitions that start in an active state and are labeled with actual character. 
When the whole string is loaded, all states that are not active will be removed. 

In general case, the automaton built by the presented algorithm is not min- 
imal and can be minimized using a standard approach. The time complexity of 
loading string depends on the number of states of the automaton. We have 
not found any tight upper bound for the number of states. Providing that 
rii = U 2 = ■ ■ ■ = rik = n, the trivial upper bound is 1 -I- n^. If the number 
of states of the CSA after loading string T is 0(v) and we implement the set 
operations in logarithmic time, loading a string T requires 0{valogv) time. 



5 Conclusion 

We have described the on-line algorithm building the CSA for a set of strings. We 
have also proven that the maximum number of states of the CSA for two strings 
is at least quadratic in the length of the input strings. However, the problem of 
tight upper bound for the number of states remains open. 



Acknowledgment. Leszek G^sieniec deserves many thanks for giving me the 
idea to deal with this automaton. 
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Abstract. We work in the domain of a regional least-cost strategy with 
dynamic validation in order to avoid cascaded errors |3j, extending the 
theoretical model to illustrate its asymptotic equivalence with global 
repair algorithms. This is an objective criterion to measure the quality of 
an error repair algorithm, since the point of reference is a technique that 
guarantees the best quality for a given error metric when all contextual 
information is available. To the best of our knowledge, it is the first 
time that such a discussion takes place. We also reformulate the parsing 
framework using parsing schemata [T|, simplifying the description. 



1 The Parsing Model 

Our aim is to parse a sentence = Wi . . .Wn according to an unrestricted 

context-free grammar Q — (A, E, P, S), where the empty string is represented by 
e. We generate from Q a push-down automaton (pda) for the language C{G)- In 
practice, we chose an lalr(1) device generated by Ice [2], although any shift- 
reduce strategy is adequate. A PDA is a 7-tuple A = {Q, E, A,6,qo, Zq, Qf) 
where: Q is the set of states, E the set of input symbols, A the set of stack 
symbols, qo the initial state, Zq the initial stack symbol, Qj the set of final 
states, and S a finite set of transitions of the form S{p,X,a) 9 (q,Y) with 
p,q G Q, a G ELI |e} and X,Y G Z\ U |e}. 

To get polynomial complexity, we avoid duplicating stack contents when am- 
biguity arises, storing them in a table I of items, X = {[q, X,i, j ], q G Q, X G 
{e} U {Xr,s}, 0 < i < j }', where q is the current state, X is the top of the 
stack, and the positions i and j indicate the substring Wi+i . . .wj spanned by 
the last category pushed onto the stack. The symbol Vr,s indicates that the part 
Ar^s+i ■ ■ ■ Ar,rir of a rule Arfi — > Arp ■ ■ . Ar^rir been recognized. 

We describe the parser using parsing schemata |l]. A parsing schema is a 
triple {X,Ti.,T>), with Ti. = {[a,i,i -\- V\, a = Wi} an initial set of items called 
hypothesis that encodes the sentence to be parsec^, and V a set of deduction 

* Research partially supported by the Spanish Government under projects TIC2000- 
0370-C02-01 and HP2001-0044, and the Autonomous Government of Galicia under 
project PGIDT01PXI10506PN. 

^ The empty string, e, is represented by the empty set of hypothesis, 0. An input string 
rci...n, n > 1 is represented by {[rci, 0, 1], [ui2,l,2], ... , [ui„, n — 1, n]}. 

J.-M. Champarnaud and D. Maurel (Eds.): CIAA 2002, LNCS 2608, pp. 276 42811 2003. 

© Springer- Verlag Berlin Heidelberg 2003 



Searching for Asymptotic Error Repair 277 



steps that allow new items to be derived from already known items. Deduction 
steps are of the form { 771 , ..., 77 ^ 1 -^ / conds}, meaning that if all antecedents iji 
are present and the conditions conds are satisfied, then the consequent ^ should 
be generated. In our case, V — u pShift y pSei y pRed y pHead^ where: 



V 



.Shift 



= {{Q,X,i,j] h [q',e,j,j + 1] 



3 [aJJ + 1] e H 

shifty, € action{q, a) 



} 



V 



Sel 






= {[q,Vr,s,k,j][q' ,e,i,k] h \q' r,s-i,i, j] /q' G reveal[q)} 

^Init ^ [go, e, 0,0]} TjHead ^ { [g, Vr,0 , b j] b fe' , j] /f' £ QOtO^q, Arfi) } 



with ( 7 o S Q the initial state, and action and goto entries in the pda tables. 
We say that q' € reveal(q) iff 3Y S TV U 27 such that shifty G action{q' ,Y) or 
q € goto(q',Y), that is, when there exists a transition from q' to q in A. A 
deduction step Init is in charge of starting the parsing process. The step Shift 
corresponds to pushing a terminal a onto the top of the stack when the action 
to be performed is a shift to state st' . A step Sel corresponds to pushing the 
Vr,rtr symbol onto the top of the stack in order to start the reduction of a rule 
r. The reduction of a rule of length > 0 is performed by a set of n^- steps 
Red, each of them corresponding to a pop transition replacing the two elements 
Vr,sXr^a placed on the top of the stack by the element The reduction 

of a rule r is finished by a step Head corresponding to a swap transition that 
recognizes the top element Vr,o as equivalent to the left-hand side Arp of that 
rule, and performs the corresponding change of state. The parse attains a worst 
case time (resp. space) complexity O(n^) (resp. 0(n^)). The input string has 
been recognized iff the final item [g/, Vo, 0 ; 0 , Qf G Qf has been generated. 



2 The Error Repair Algorithm 



We first assume that we are dealing with the first error detected, using the ter- 
minology introduced in [3| • We extend the item structure with the accumulated 
error counter e, resulting in items [p, X,i, j,e\. Once the detection items have 
been fixed, we apply the set of deduction steps in error mode, T’erron that follows: 



■^Shift 

■^error 



V 



Insert 

error 



'T^Repla 

terror 



{[q,X,i,j,0] [q',a,j,j + 1,0] 



3[ffl,i, j + i]eH 

shifty! G action[q, a) 



} 



= {[?,£,*, 7,0] h [q,e,j,j,I{a)\ / ^ shifty, G action{q,a)} 
T>°ror® = {[q,e,i,j,0] h [q,e,j,j + l,D{wi)]} 

{[q,e,i,j,0] h [q,e,j,j + l,R{a)] / shifty, G action{q,a)} 



This process continues until a repair covers both error and detection items. Once 
this has been performed on each detection item, we select the corresponding 
regional repairs and the parse goes back to standard mode. Error counters are 
summarized at the time of reductions by adding counters on popped items: 

©Irror = {[?,£,*, 7, «] b fe, Vr.n, , 7, 7, c] , reducBr G action{q, a)} 

T>Sror = {[q,Xr,s,k,j,e][q',e,i,k,e'] h [g', Vr.^-i, i, 7, e + e'] /g' G reveal{q)} 

= { [g, Vr.o,*,7,e] b [q',e,i,j,e] jq' G goto{q,Arfi)} 
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with When the current repair is not the first one, it can modify a 

previous repair in order to avoid cascaded repairs by adding the cost of the new 
error hypotheses to profit from the experience gained from previous ones. 

3 Asymptotic Behavior 

We consider the arithmetical expressions to illustrate this point. In the worst 
case, when the error repair zone becomes the entire input string, performance 
and cost are the same as for global error repair. We introduce two deterministic 
grammars, Ql and Qr, and a non-deterministic one Qn- 

E ^ E + T I T Gr-.E^T + E\ T S ^ S + S | (S) | number 

T ^ (E) I number T ^ (E) | number 

AsGn contains a rule “S' ^ S + S”, sentences of the form bi + b 2 + - ■ . + &i+i have 
a number of exponential parses, which allows us to evaluate strongly ambiguous 
contexts. In the deterministic case, parses are built from the left-associative 
(resp. right-associative) interpretation for Gr (resp. GR)^ in order to estimate the 
impact of traversal orientation. Erroneous input strings are of the form: “&i -|- 
. • ■ + bi-i + {bi + . . . + (f'[n/3]+^[n/3]+i^[n/3]-i-2 + - • - + bebe+i + bi+2 + ■ ■ ■ + bn\ where 

i G {[»T'/3], . . . , 1} and £= 3[n/3] — 2i-|-l, with [n/3] being the integer part of n/3. 
Given i, regional repairs are obtained by replacing tokens b^i by closed brackets 
to obtain “bi + . ■ . + bi-i + {bi + . ■ . + (fe[n/3]+^[n/3] + l) + - • - + ^^)+^£+2 + - • ■ + bn” ■ 

3.1 The Error Repair Region 

We focus on the evolution of this region in relation to the location of the point 
of error, in opposition to static strategies associated to global repair approaches. 

Location of Points of Detection. As is shown in the left-hand-side of 
Fig. m when we deal with global approach all input positions are points of 
detection. In the regional case, results depend on the grammar. So, although 
the number of points of detection grows with i because of the increase in 
the number of points of error, this number is higher for Gn (resp. Gr)- This 
is due to the right-associativity introduced by the rule ^‘S ^ S + S'’ (resp. 
“E ^ T + E ” ) , which generates a reduction for each “-I-” operator in the parsed 
prefix, illustrating the convergence of regional repairs with global ones. The 
reason for which results for Gn and Gr do not agree with results for the global 
case is because in regional repairs, operators “-I-” are not points of detection, 
while this is possible in a global one. The maximal number of points for Gn and 
Gr, corresponding to the maximum size of the repair region as is shown in the 
right-hand-side of Fig. [T] is approximately half of those related to the global case. 

Factors Determining the Size. We first focus on the case of Gr (resp. Gn), 
profiting from the sequence of cascaded errors raised by the repair process ex- 
emplified. When the algorithm detects the first point of error at &[„/ 3 ]+i, it takes 
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b[n/3] as point of detection and proposes as regional repair the replacement of 
b[n/3]+2 by a closed bracket. Once this has been done, the algorithm returns the 
control to the parse until a new point of error is detected at \n/3\+3- In Ibis case, 
“(6[„/3]” is taken as the point of detection, which implies that we have moved 
back to a point previous to that proposed for the first error detected at &[„/3]+i- 
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regional case, G_L 
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Fig. 1. Points of detection (resp. repair scope) vs. position of the point of error 

More exactly, the algorithm asks whether the first regional repair applied was 
not optimal, taking into account the information about the parsing process now 
available. Perhaps the best solution for this first error would have been either 
to delete fo[n/3]+2 or to insert “+” between &[„/3]+i and f»[„/3]+2, which at that 
moment were not considered because the reductions defining the scopes of these 
repairs were not minimal in relation to that of the regional repair finally applied. 



points of detection 
scopes of regional repair 




• • • + ( ^[?j/3]-l + ( ^[n/:i] + ^[n/3]+l ^[?j/3]+2 + ^[??/3]+3 ^[n/3]+i + ^[n/3]+5 ^[?i/3]+6 + • • • 
A ^ 



Fig. 2. An example on cascaded errors for Qr and Qn 

We then repeat the same steps as in the first case, proposing the regional 
repair that replaces &[„/3]+4 by a closed bracket, followed by a shift over 
So, the frontier of the new error repair region is “(f»[n/3]-i + (&[n/3] + ^[n/3]+i) + 
^[n/3]+3)” 5 which includes the scope of the previous regional repair, whose frontier 
was “(^[n/3] + ^[n/3]+i)”; as is shown in Fig. [21 The algorithm continues to apply 
the previous process for alH G {[n/ 3 ], . . . , 1 }, until the size of the repair region 
extends to the whole original input string, as is shown in Fig. |T] 

In the case of the size of the repair region grows with the position of the 
point of error, 6/, I G {[u/ 3 ] + 1 , [n/ 3 ] + 3 , . . . , 3 [n/ 3 ] — 2 i + 1 }. This behavior is 
also a consequence of the presence of cascaded errors, as is shown in Fig. O In 
comparison with previous results, when the algorithm detects the first point of 
error at f»[„/3]+i, it takes f»[n/3] as the point of detection and proposes as regional 
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points of detection 
scopes of regional repair 




• + ( ^[n/3|-2| + ( ^[n/3|-l + ( ^[n/3| + ^[n/3|+l ^[n/3|+2 + ^[n/3|+3 ^[n/3|+4 + ^[n/3|+5 ^[n/3|+6 + ■ 
^ A 



Fig. 3. An example on cascaded errors for Ql 

repair the replacement of b[n/3]+2 by a closed bracket, as was the case for 
and Qn- As for Qr, the rule providing the reduction is “F" — > (A)”. However, 
in this case, this reduction does not characterize a regional repair because it is 
followed by a chain of reductions in Qr previous to the next shift action, and not 
by an immediate shift action. These reductions are given by the rules “T — > F" 
and “if ^ if + T” , and the frontier of the repair region associated to this first 
error is “&[„/3]_i + (^[n/3] + &[n/3]+i)”- Applying a similar reasoning to the next 
errors in the input string, we conclude that the sizes of the error repair regions 
are now larger, as is shown in the right-hand-side of Fig. [T] which also illustrates 
the asymptotic convergence with global repairs. So, the repair region when the 
last point of error, 6 ^, is to the right of the input, includes the total input string. 

3.2 The Computational Cost 

Items are the basis for showing the computational behavior of our proposal. 
The cost of the algorithm is, in the worst case, given by the cost of global error 
repair approaches, due to asymptotic equivalence between regional and global 
repairs. Our aim is to focus on the dependency of grammar design. 

The Case of Global Repairs. The generation is illustrated in the left-hand- 
side of Fig. m In all cases the number of items generated remains constant 
because it is only dependent on the length of the input string. These strategies 
expend equal effort on all parts of the program, including areas without errors. 
The situation of the curve for Qjq is justified by subsumption phenomena 
between items generated by the parse process. In effect, the compact represen- 
tation of Qn in relation to Qr and Ql in terms of the number of rules facilitates 
the application of such mechanisms. The greater cost of Qr in relation to Qr is 
due to the introduction of a non-determinism by the error hypotheses. When a 
token “ 6 ;” is shifted in Qr, the only pda action available is the reduction of all 
or part of the analyzed prefix, since we can assume that the lookahead is “-I-”, 
“)” or H. For Qr, two possibilities exist. When the lookahead is a “-I-”, a shift 
takes place; but when it is a “)” or H, a reduction is made. Thus, in the case 
or Qr, the error repair algorithm introduces a larger number of parse conflicts, 
and hence items. 

The Case of Regional Repairs. The generation is discussed with reference 
to the right-hand-side of Fig. H] The general distribution of curves for Qr, Qr 
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Fig. 4. Items for global (resp. regional) repairs vs. position of the point of error 

and l/AT is the same as mentioned for global repairs and it can be justified in 
the same manner. It is of interest to compare the results for global and regional 
repairs. So, Fig. Elshows the number of items whose generation has been saved 
going from global to regional repair, illustrating the asymptotic convergence. The 
difference in terms of items generated is minor when the point of error is situated 
to the right of the input string, enlarging the repair region. This difference does 
not reach zero, which is in apparent contradiction with the above-mentioned 
convergence. We should take here into account that even though the size of the 
repair region can be the same for both global and regional repairs, the latter are 
not forced to apply the error hypotheses on all the error parse branches. 




Point of error 



Fig. 5. Saved items from global to regional repair vs. position of the point of error 
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Abstract. In this paper we discuss efficient symbolic representations for 
infinite-state systems specified using linear arithmetic constraints. We 
give new algorithms for constructing finite automata which represent in- 
teger sets that satisfy linear constraints. These automata can represent 
either signed or unsigned integers and have a lower number of states com- 
pared to other similar approaches. We experimentally compare different 
symbolic representations by using them to verify non-triviaf specification 
examples. In many cases symbolic representations based on our construc- 
tion algorithms outperform the polyhedral representation used in Omega 
Library, or the automata representation used in LASH. 



1 Introduction 

Symbolic representations enable verification of systems with large state spaces 
which cannot be analyzed using enumerative approaches. Recently, symbolic 
model checking has been applied to verification of infinite-state systems using 
symbolic representations that can encode infinite sets [8,5,7] . One class of infinite- 
state systems are systems that can be specified using linear arithmetic formulas 
on unbounded integer variables. Verification of such systems have many interest- 
ing applications such as monitor specifications, mutual exclusion protocols [5,7], 
and parameterized cache coherence protocols [6]. In this paper we present new 
symbolic representations for linear arithmetic formulas and experimental results 
on efficiency of different symbolic representations. 

There are two basic approaches to symbolic representation of linear arith- 
metic constraints in verification: 1) Polyhedral representation: In this approach 
linear arithmetic formulas are represented in a disjunctive form where each dis- 
junct corresponds to a convex polyhedron. Each polyhedron corresponds to a 
conjunction of linear constraints [8,7]. This approach can be extended to full 
Presburger arithmetic by including divisibility constraints (which can be repre- 
sented as an equality constraint with an existentially quantified variable) [5,2]. 
2) Automata representation: An arithmetic constraint on v integer variables can 

* This work is supported in part by NSF grant CCR-9970976 and NSF CAREER 
award CCR-9984822. 
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be represented by a w-track automaton that accepts a string if it corresponds to 
a w-dimensional integer vector (in binary representation) that satisfies the corre- 
sponding arithmetic constraint [4,11]. For both of these symbolic representations 
one can implement algorithms for intersection, union, complement, existential 
quantifier elimination operations, and subsumption, emptiness and equivalence 
tests, and therefore use them in model checking. 

In this paper we present new construction algorithms for the automata rep- 
resentation and also experimentally compare these different approaches. We give 
a new algorithm for constructing a finite automata representation for sets sat- 
isfying Presburger arithmetic formulas. The size of the resulting automaton in 
our construction has the same upper bound as the construction given in [4], 
however, our construction is also able to handle negative integers. The size of 
the resulting automaton in our construction is different than the construction 
given in [11]. We implemented our construction algorithm using the MONA tool 
[9] and integrated it to a set of tools for infinite-state model checking [12]. We 
experimented with a large set of examples. To compare the performance of our 
construction algorithm to other approaches we also integrated the LASH tool 
[1] which uses the automata construction given in [11], and Omega Library [2] 
which uses a polyhedral representation to the same set of tools and ran them on 
the same set of examples. Our experimental results show that our construction 
algorithm produces more compact representations than the construction algo- 
rithm given in [11]. Also automata representation is more efficient compared to 
the polyhedral representation used in [2]. 

2 Finite Automata Representation for Presburger 
Formulas 

In this section we give a brief description of our algorithm for constructing a 
finite automaton that accepts the set of natural number tuples that satisfy a 
Presburger arithmetic formula on v variables. A full description and analysis 
of the algorithm and a comparison to other approaches is given in [3]. We en- 
code numbers using their binary representation. A w-tuple of natural numbers 
(ni,n 2 , ...,n„) is encoded as a word over the alphabet {0, 1}", where the ith let- 
ter in the word is {bn, bi 2 , ..., biy) and bij is the ith least significant bit of number 

Tlj. 

The construction relies on a basic state machine (BSM) that performs lin- 
ear arithmetic on non-negative integer variables. Each state of the BSM for 
121=1 i® associated with a carry value. At any point, the BSM adds up the 
bit of the ith variable of the current symbol at times for each i, plus the carry 
value of the current state. It writes the resulting bit to the output and moves 
to a new state according to the value of the new carry. The number of possible 
values of the carry (and thus the number of states) is 12i=i I®*!- 

A finite automaton (FA) for 221=1 = 0 is similar to the BSM but has an 

extra sink state. Whenever the resulting bit is 1, the FA moves to the sink state, 
otherwise it continues as described before. The only initial and accepting state 
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is the one associated with 0 carry. A FA for X)i=i Xi < 0 has no sink state. 
The accepting states are those associated with negative carries. The construc- 
tion of FA for all other kinds of linear constraints (<,>,>) is similar. If the 
right hand side of the equations or inequations is a non-zero constant c, the 
state with carry value of — c will be the initial state of the FA. If no such state 
exists, we need to introduce more states corresponding to carry values between 
— c and the carry value closest to — c. Thus the number of states now becomes 
at most: S = | min(— c, <o ®*)l + I X)a >o This is a tighter up- 

per bound than the one given in [4], even though the construction algorithms 
are similar. An alternative way to cope with the constant term c is to stack 
log 2 c -I- 1 BSMs, where the ith BSM compares the resulting bit against the ith 
bit of c. Now the total number of states is at most (log 2 c -I- 1) • X)i=i I®* I- We 
can choose which alternative to use depending on the expected upper bound on 
the number of states. Finally, after constructing FA for atomic linear constraints 
(equations and inequations) we can construct FA for any Presburger formulas us- 
ing standard automata operations such as intersection, union, complementation 
and projection. 

Based on the same ideas we can construct FA for formulas on all integers 
(including negative), using 2’s complement arithmetic. The procedure is based on 
the fact that in order for the FA to accept the encoding of a tuple of numbers, 
it must also accept the encoding of the same numbers with arbitrarily many 
sign bits (i.e. the most significant bit of each number repeated arbitrarily many 
times). The FA contains two clones of each state of the BSM, one accepting 
and one rejecting. For equations, looping transitions in the BSM that write 0 
go to the according accepting clone. All other transitions that write 0 go to the 
rejecting clone. All transitions that write 1 in the BSM go to the sink state. 
For inequations, looping transitions that write 1 go to the accepting clone and 
those which write 0 go to the rejecting clone. Any other transition goes to the 
appropriate accepting clone, iff by repeatedly receiving the same combination of 
bits the BSM will eventually enter a loop which writes 1. Otherwise it goes to 
the rejecting clone. These FA are only twice as large as those for non-negative 
integers described before. 

Note that our construction algorithm and the one in [11] result in different 
automata, because the accepted languages are different (one is the reverse of the 
other). Moreover, we can prove the same or lower upper bound on the size of 
the resulting automata. Finally, in our algorithm, once a state has been created, 
all transitions originating from it can be computed immediately (as opposed to 
[11]), which is more convenient when transitions are stored using BDDs. 



3 Implementation and Experiments 

In [10] polyhedral and automata representation for arithmetic constraints are 
compared experimentally for reachability analysis of several concurrent systems. 
The results show no clear winner. On some problem instances the polyhedral 
representation is superior, on some others automata representation is. Our ex- 
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perimental setup is more reliable compared to [10]. In [10] boolean variables 
are mapped to integer variables when polyhedral representation is used. This is 
an inefficient encoding which gives an unfair advantage to the automata repre- 
sentation. In our experiments boolean variables are not mapped to integers in 
any representation. Also, our tools perform full CTL model checking including 
liveness properties instead of just reachability analysis discussed in [10]. 

We integrated our construction algorithms to an infinite state CTL model 
checker built on top of the Composite Symbolic Library [12]. The Composite 
Symbolic Library defines an abstract interface for the operations used in sym- 
bolic verification [12]. To integrate a new symbolic representation to the Com- 
posite Symbolic Library one implements this abstract interface with specialized 
operations. Composite Symbolic Library supports a disjunctive composite repre- 
sentation for formulas on integer and boolean variables. A disjunctive composite 
representation is in the form Vr=i AteT^*** where pn denotes the formula of type 
t (which could be integer or boolean) in the fth disjunct, and n and T denote 
the number of disjuncts and the set of variable types (T = {integer^ boolean})), 
respectively. The methods such as intersection, union, complement, satisfiabil- 
ity check, subsumption test, which manipulate composite representations in the 
above form are implemented in the Composite Symbolic Library by calling the 
operations on integer and boolean formula representations [12]. 

We integrated five different symbolic representations to the Composite Sym- 
bolic Library. The first three use the disjunctive composite representation de- 
scribed above to combine formulas on integer and boolean variables. We used 
the BDD representation for boolean formulas. We implemented three different 
integer formula representations using LASH [1]) (version V3), Omega [2] (version 
V2), and our automata construction algorithm (version VI) which uses MONA 
automata package [9] as an automata manipulator. We also implemented two 
automata based representations using LASH (version V5) and our construction 
algorithms (version V4) again built on top of MONA automata package, for both 
boolean and integer variables without using the disjunctive composite represen- 
tation. The states of both boolean and integer variables can be represented in an 
automaton, hence one can avoid using the disjunctive composite representation. 

We experimented with a large set of examples. Specification of the examples 
and properties are available at: http://www.cs.ucsb.edu/~bultan/composite/. 

The results of our experimental evaluation of different representations for lin- 
ear integer arithmetic constraints are shown in Table 1. We obtained the exper- 
imental results on a SUN ULTRA 10 work station with 768 Mbytes of memory, 
running SunOs 5.7. For each version of the verifier we recorded the following 
statistics: 1) Time elapsed during the construction of the symbolic representa- 
tion of the transition system, shown in the table as CT. 2) Time elapsed during 
the verification process, shown as VT. It includes the time needed for forward 
or backward fixpoint computations, however, it excludes the construction time 
(CT). 3) The maximum amount of memory used by the verifier, shown as Mem. 
Also for VI, V3, V4 and V5 that use automata as a symbolic representation we 
recorded the size (number of states) of the automaton representing the transition 
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system, shown as TRS, and the size of the largest automaton computed during 
the fixpoint computation, shown as MS. As discussed above our automata con- 
struction algorithm used in versions VI and V4 uses MONA automata package. 
MONA automata package uses BDDs to store the transition relation of the au- 
tomata. Therefore, to make the comparison with LASH fair, instead of giving 
the number of automaton states for versions VI and V4, we give the total num- 
ber of HDD nodes used in the MONA representation. For the barber and ticket 
problems we used forward fixpoint computation, while for all other problems 
backward fixpoint computations were used. 

By carefully inspecting Table 1 one can make the following remarks. Versions 
V4 and V5 that use only automata as a single representation for both integer and 
boolean variables require less memory, both for the transition relation and the 
fixpoint iterates, than VI, V2 and V3 that use disjunctive representations. This 
can be explained by the fact that deterministic minimal automata are canonical, 
while the composite symbolic representation is not. For most of the examples but 
not all, V4 performs better than V5 both in construction and verification times 
and memory requirements. This is in accordance to the measured sizes of the 
two automata implementations. On the other hand, by inspecting the data for 
VI, V2, V3, we can see a clear manifestation of time-memory tradeoff. For most 
examples VI is faster than V2 and V3, but uses more memory. V3 consumes 
less memory, but is usually slower than the other two. V2 appears to be a good 
compromise between time and memory efficiency. Now, when comparing the 
composite versions VI and V3 with their “pure automata” counterparts V4 and 
V5 respectively, we can conclude that the later generally perform better than 
the earlier, with the exception of most construction time measurements for V5 
being greater that those for V3. 

Based on these results we conclude that the versions based on our automata 
construction algorithms for linear integer arithmetic formulas and implemented 
using MONA automata package (VI and V4) are the most time efficient of all, 
with V4 being more memory efficient than VI. If a composite representation 
must be used. Omega library based V2 can be a good time-memory compro- 
mise. Finally LASH based V5 is a close competitor to V4, but suffers from high 
construction times. 
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Abstract. There are several data structures that allow searching for 
a pattern P in a preprocessed text T in time dependent just on the 
length of P. In this paper we present an implementation of CDAWG’s — 
Compact Direct Acyclic Word Graphs. While the previous implementa- 
tions of CDAWG’s required from 7n to 83n bytes of memory space, ours 
achieves l.lnto 5n for a text T of length n. The implementation is suit- 
able for large data files, since it minimizes the number of disk accesses. If 
disk accesses are not to be optimized, space requirements can be further 
decreased. 



1 Introduction 

Finding pattern P = piP 2 ---Pm (of length m) in a text T = tit 2 ■ ■ -tn (of 
length n) is a very important task for any data retrieval system. There is a huge 
amount of unstructured texts, where one needs to find an information, and other 
types of texts appear at a steady rate. If we know the text T in advance, we 
can preprocess it and build a complete index over T. In such a way we improve 
searching times up to 0{m) (on a fixed alphabet, which is the standard practical 
situation) . Several data structures have been developed for that purpose (see [7] , 
uni or 0). Kurtz m presents implementation experiments on several of them 
(DAWG-Directed Acyclic Word Graph, GDAWG-Gompact DAWG, suffix tree). 
In this paper we discuss GDAWG implementation. 

A GDAWG may be regarded as the compaction of a DAWG. A DAWG 0 
1^ is a structure that can answer a query in time 0{m). The size of DAWG’s 
is linear with n as well as their construction time. In order to decrease the 
number of states of DAWG’s, a data structure called Gompact DAWG (GDAWG) 
was developed |3I4| . This GDAWG was built by compaction of DAWG. Direct 
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constructions, avoiding a preliminary DAWG construction, has been designed in 
j9] and [TT] . 

In the present paper we describe our implementation of CDAWG and com- 
pare it with other similar structures allowing fast pattern searching. While pre- 
vious implementations of GDAWG’s required from 7n to 23n bytes of memory 
space, we show that ours achieves 1.7n to 5n bytes for a text T of length n. This 
proves that the implementation is suitable for large data files, since it minimizes 
the number of disk accesses. 

2 CDAWG 

One way to develop a complete index over a text T is to construct its suffix trie. 
We can build the suffix trie by merging finite automata of suffixes of T (each 
accepting a suffix) giving to all these automata the same initial state. 

If we replace sequence of transitions with single-symbol labels that do not 
fork by one transition with a string label (operation called compaction) , we get a 
suffix tree. If we minimize the suffix trie, we get DAWG. Finally, if we minimize a 
suffix tree, we get a GDAWG. If we compact DAWG, we get the same GDAWG. 
This approach is used in m- 

Another way is to construct a simple non-deterministic finite automaton 
(NFA) accepting T. Then we connect the initial state with all the other states 
using £-transitions and transform this NFA to the equivalent deterministic fi- 
nite automaton. Thus we get a DAWG that can be further compacted to get a 
GDAWG. 

An algorithm for direct construction of GDAWG without constructing 
DAWG first was introduced in [S] while an algorithm for on-line construction of 
GDAWG was introduced in m 

3 Implementation 

The algorithm at the origin of our implementation of GDAWG’s processes in 
two steps: first, we construct a GDAWG as described in jS] and we sort states 
according to their topological order; then, we classify the states into three classes 
according to their maximum length of incoming transition label {maxLen) and 
we add fourth class for the terminal state qt, from which no transition leads: 

— Glass I (Qi): the initial state qo and the states with maxLen = 1, 

— Glass II (<5ii): the states with 1 < maxLen < Limit, 

— Glass III (Qiii): the states with maxLen > Limit, 

— Glass IV (<5iv): the terminal state qt (no transition leads from qt). 

The typical distribution of states according to maxLen is shown in Table [H As 
we can see the most of the labels are very short. That is the reason, why we have 
decided to create Glasses. The most frequent Glass I occupies the smallest part 
of memory. For each state with maxLen > 1, a parameter Limit distinguishes 
what implementation (either Glass II or III) is more space-efficient. 
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Table 1. Distribution of states according to their maxLen in Calgary Corpus file 
paper4. The maximum maxLen for file paper4 is 28. 



maxLen 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 




no. of states 


2546 


464 


231 


142 


86 


57 


40 


27 


25 


16 


5 


3 


11 


6 


3 


1 


2 


0 


2 


1 


0 





For states Qi we store just outgoing transitions. For states Qii and Qiii we 
have to store also incoming transition label. While in Class II we store the whole 
label (the longest one), in Class III we store only a pointer to the source text 
(EndPos). The terminal state (the state in Qiv) is not memorized. 

We can also store some optional additional data. In Classes I, II, and III 
we can store one bit indicating, whether the corresponding state is final or not. 
Doing so, the CDAWG recognizes all suffixes of T, otherwise it recognizes all 
factors of T considering that all states are final. In Classes I and II we can also 
memorize the corresponding position in text (EndPos). The terminal state is 
always final and has EndPos = n. 

Then we store outgoing transitions. We store the number of outgoing tran- 
sitions and the first symbol of each of outgoing transition labels (string of first 
symbols — SPS). Then we store the outgoing transition records. 

At the beginning of each outgoing transition we store the number of Class of 
the destination state. For Classes I, II, and III we store a pointer to the beginning 
of the destination state (number of bits to be skipped to reach the destination 
state). For Classes II, III, and IV we store also the length of outgoing transition 
label {Len — 1). 

We do not store the first symbol of the label since it is already stored in the 
SPS. Removing the SPS and storing entire label strings in destination states, 
could decrease space requirements, but using the CDAWG and searching for the 
desired transition, would force to read from as many parts of the CDAWG data- 
file as the number of outgoing transitions. Thus this is likely to require more 
disk accesses and would significantly increase the resulting searching time. 

4 Results 

We made some experiments on Calgary and Canterbury Corpora test file^ [l]. 
The comparison with other methods is shown in Table [2) Value for DAWG, 
CDAWG, and Suffix Tree are taken from [T^; DAWG.B is the implementation 
of DAWG’s designed by M. Balik [^; finally, CD.HCl is our implementation. If 
we do not care for the number of disk accesses, we can move symbols from the 
SPS to the destination state. The space requirements further decrease as we can 
see in column CD.HC2. The best results are highlighted. 

A single character in the table denotes the type of file: e for English text, / 
for formal text (like programs), b for binary files (i.e. containing 8-bit symbols), 
and d for DNA sequences. 

^ File ptt5 from Canterbury Corpus and pic from Calgary Corpus are the same. 



292 



J. Holub and M. Crochemore 



Table 2. Space requirements for suffix data structures applied to files of the Calgary 
and Canterbury Corpora (values are in bytes per symbol of text). 



file 


type 


1^1 


length 


DAWG 


CDAWG 


Suff.Tree 


DAWG.B 


CD.HCl 


CD.HC2 


bookl 


e 


81 


768771 


30.35 


15.75 


9.83 


3.66 


4.42 


3.78 


book2 


e 


96 


610856 


29.78 


12.71 


9.67 


3.17 


3.67 


3.14 


paper 1 


e 


95 


53161 


30.02 


12.72 


9.82 


2.98 


3.26 


2.72 


paper2 


e 


91 


82199 


29.85 


13.68 


9.82 


3.06 


3.58 


3.01 


papers 


e 


84 


46526 


30.00 


14.40 


9.80 


3.12 


3.62 


3.02 


paper! 


e 


80 


13286 


30.34 


14.76 


9.91 


3.04 


3.46 


2.82 


papers 


e 


91 


11954 


30.00 


14.04 


9.80 


2.97 


3.34 


2.72 


paperG 


e 


93 


38105 


30.29 


12.80 


9.89 


2.96 


3.27 


2.73 


alice29 


e 


74 


152089 


30.27 


14.14 


9.84 


3.20 


3.82 


3.23 


IcetlO 


e 


84 


426754 


29.75 


12.70 


9.66 


3.12 


3.56 


3.03 


plrabnl2 


e 


81 


481861 


29.98 


15.13 


9.74 


3.52 


4.15 


3.53 


bible 


e 


64 


4047392 


29.28 


10.87 


7.27 


2.94 


3.26 


2.88 


world 192 


e 


94 


2473400 


27.98 


7.87 


9.22 


2.53 


2.43 


2.09 


bib 


/ 


81 


111261 


28.53 


9.94 


9.46 


2.68 


2.68 


2.24 


news 


/ 


98 


377109 


29.48 


12.10 


9.54 


3.15 


3.44 


2.91 


progc 


/ 


92 


39611 


29.73 


11.87 


9.59 


2.87 


3.06 


2.54 


progl 


/ 


87 


71646 


29.96 


8.71 


10.22 


2.40 


2.39 


2.03 


progp 


/ 


89 


49379 


30.21 


8.28 


10.31 


2.35 


2.28 


1.92 


trans 


/ 


99 


93695 


30.47 


6.69 


10.49 


2.35 


1.95 


1.66 


fields. c 


/ 


90 


11150 


29.86 


9.40 


9.78 


2.43 


2.39 


1.96 


cp.html 


/ 


86 


24603 


29.04 


10.44 


9.34 


2.64 


2.58 


2.12 


grammar 


/ 


76 


3721 


29.96 


10.60 


10.14 


2.36 


2.44 


1.97 


xargs 


/ 


74 


4227 


30.02 


13.10 


9.63 


2.75 


2.99 


2.40 


asyoulik 


/ 


68 


125179 


29.97 


14.93 


9.77 


3.34 


3.84 


3.23 


geo 


b 


256 


102400 


26.97 


13.10 


7.49 


3.18 


2.66 


1.92 


objl 


b 


256 


21504 


27.51 


13.20 


7.69 


2.98 


2.39 


1.67 


obj2 


b 


256 


246814 


27.22 


8.66 


9.30 


2.67 


1.96 


1.51 


pic 


b 


159 


513216 


27.86 


8.08 


8.94 


1.63 


2.17 


1.79 


kennedy 


b 


256 


1029744 


21.18 


7.29 


4.64 


1.57 


1.65 


1.10 


sum 


b 


255 


38240 


27.79 


10.26 


8.92 


2.53 


2.86 


2.29 


E.coli 


d 


4 


4638690 


34.01 


23.55 


12.56 


4.46 


5.46 


5.24 



We made also experiments on several other DNA sequences (of size n = 40 kB 
to 1.5 MB) and the space requirements were about 4.5n (CD.HCl: 4.21-5.15, 
CD.HC2: 3.99-4.92). 

Further decrease of space requirements can be achieved, if we store states at 
even addresses or addresses divisible by 4 or 8. 

In addition to the difference between DAWG and CDAWG structures, we can 
discuss a difference between implementation of DAWG. B and GD.HG. DAWG.B 
was developed for the minimum space requirements, therefore it uses some com- 
pression techniques (for alphabet symbols, number of edges, edges, etc.). The 
GD.HG was developed for the speed, therefore it uses the SFS and no compres- 
sion. It also needs the source text to be stored out of the data structure. 
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5 Conclusion 

In the paper we presented an efficient implementation of CDAWG’s. The space 
requirements are much lower than in the previous works and vary from 1.65 to 
5.46 bytes per symbol of input text. The implementation requires that the source 
text is stored, so the total space increases by one byte per character, but other 
implementations also need the source text. Only |2] can reconstruct the source 
text from its implementation of DAWG’s but it takes some time. 

The space requirements can be further decreased. For instance we can com- 
press the labels or we can remove the SFS and store all incoming transition labels 
in the destination states (GD.HG2). But as mentioned above, it would increase 
number of disk accesses. In our implementation we require at most m -|- 1 disk 
accesses^ when searching for a pattern of length m — we traverse at most m -I- 1 
states (including the initial state) and all outgoing transitions are located with 
the state. But in case of GD.HG2, we would get (mr -I- 1) disk accesses, where r 
is the average number of outgoing transitions. In such a case we need, for each 
outgoing transition, to look at the destination state to find out the transition 
label. 

The goal of this work is to implement GDAWG so that it can be used for 
large source texts. In such a case we require minimum disk accesses that are in 
the worst case (the required data are not in disk cache) 100,000 times slower 
than memory accesses. When running GDAWG, we traverse the resulting data- 
file forward (we do not go backward). We can also set any size of GDAWG 
buffer (the part of GDAWG stored in main memory) and thus control space 
requirements when running GDAWG. 
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Abstract. Nondeterministic finite automaton (NFA) cannot be directly 
used because of its nondeterminism. One way how to use NFA is to deter- 
minize NFA and the second way is to use one of simulation methods. This 
paper deals with one of the simulation method called dynamic program- 
ming. We present the method on one of the pattern matching problems 
as well as modifications for several other tasks. 



1 Introduction 

In Computer Science there is a class of problems that can be solved by finite 
automata. For some of these problems one can construct directly a deterministic 
finite automaton (DFA) that solve them. For other problems it is easier to build 
a nondeterministic finite automaton (NFA). Since one cannot use NFA directly 
because of its nondeterminism, one should transform it to the equivalent DFA 
using the standard subset construction PTTI] or one should simulate a run of the 
NFA using one of the simulation methods |B]. 

When transforming NFA, one can get the DFA with a huge amount of states 
(up to where \Qnfa\ is the number of states of NFA). The time complex- 

ity of the transformation is proportional to the number of states of DFA. The 
run is then very fast (linear with the length of an input text). On the other hand, 
when simulating the run of NFA, the time and space complexities are given by 
the number of states of NFA. The run of the simulation is then slower. 

There are known three simulation methods |B]: basic simulation method, dy- 
namic programming , and hit parallelism. All of these methods use breadth-first 
search for traversing the state space. The first overview of the simulation meth- 
ods was presented in |B]. This paper is a continuation of [Zj, where the basic 
simulation method and the bit parallelism were described in detail. 

The dynamic programming was firstly used for the approximate string match- 
ing using the Levenshtein distance in m . It just computed edit distance between 
the pattern and the text. In P!. each configuration of integer vector of the dy- 
namic programming was considered as a state of DFA and in such a way the 

* This research has been partially supported by MSMT research program No 
MSM 212300014 and by GACR research programs No GP201/01/P082 and 
GA201/01/1433. 



J.-M. Champarnaud and D. Maurel (Eds.): CIAA 2002, LNCS 2608, pp. 295 4300l 2003. 
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DFA was constructed (the size of the DFA is min(3™, {k + l)!(fc + 2)”^ *) [13] 
E], where m is the length of the pattern and k is the maximum number of dif- 
ferences allowed). Then in | 11I3I4| it was shown that the dynamic programming 
is a simulation of NFA. 

2 Definitions 

Let 27 be a nonempty input alphabet, 27* be the set of all strings over 27, e be 
the empty string, and = 27* \ {e}. If a G 27, then a = 27 \ {a} denotes a 
complement of a over 27. 

Nondeterministic finite automaton (NFA) is a 4-tuple (Q, S,S, qo, F), where 
Q is a set of states, 27 is a set of input symbols, 5 is a mapping Q x (27 U {e}) 
2lQI 

, go is an initial state, F Q Q is a, set of final (accepting) states. Deterministic 
finite automaton (DFA) is NFA, where 5 is a mapping Qx E Q. We can extend 
5 to (5 mapping Q x 27* i-^- 2^^^ for NFA or Q x 27+ i-^- Q for DFA respectively. 
DFA (resp. NFA) accepts a string w G 27* if and only if 6{qo,w) G F (resp. 
S{qo,w) n T" yf 0). 

An active state of NFA, when the last symbol of a prefix w of an input string 
is processed, denotes each state q, q G 6{qo,w). At the beginning, only go is an 
active state. 

A depth of state q in NFA is the minimum number of moves that are needed 
to get from an initial state go to this state g without using e-transitions. A level 
of state q in NFA is the minimum among the numbers of differences associated 
with all final states reachable from g. In the figures of this paper, the states of 
the same depth are in the same column and the states of the same level are in 
the same line (row). 

An algorithm A simulates a run of an NFA, if Vw, w G 27*, it holds that 
A with given w at the input reports all information associated with each final 
state g/, g/ G F, after processing w, if and only if qf G S(qo,w). 

3 Dynamic Programming 

The basic simulation method described in m maintains a set of active states 
during the whole simulation process. While in bit parallelism this set is repre- 
sented by bit vectors, in the dynamic programming this set is represented by a 
vector of integer variables. We divide all states into some subsets and each of 
the subsets is represented by one integer. The value of this integer then holds 
the information, what states of the subset are active. 

The simulation using the dynamic programming will be shown on the NFA for 
the approximate string matching using the Levenshtein distance. This problem 
is defined as a searching for all occurrences of pattern P = pip 2 ■ ■ ■ Pm in text 
T = t\t 2 ■ ■ - tn, where the found occurrence X (substring of T) can have at most 
k differences. The number of differences is given by the Levenshtein distance 
Dr(P, X), which is defined as the minimum number of edit operations replace, 
insert, and delete, that are needed to convert P to X. 
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Fig. 1. Dynamic programming uses for each depth of states of NFA one integer vari- 
able d. 

Figure [T] shows the NFA constructed for this problem (m = 4, fc = 2). 
The horizontal transitions represent matching, the vertical transitions represent 
insert, the diagonal e-transitions represent delete, and the remaining diagonal 
transitions represent replace. The self- loop of the initial state provides skipping 
the prefixes of T located in front of the occurrences. Formula [T] simulates the 
run of the NFA in Figured 

djfl ^ j, 0<j<m 

do,i <— 0, 0 < i < n 

dj^i ^ min(if ti = Pj then dj-i^i-i else dj-i^i-i + 1, (1) 

if j < m and ti ^ Pj+i then dj^i-i -1-1, 0 < i < n, 

dj-i^i + 1), 0 < j < m 

In the dynamic programming for the approximate string matching using the 
Levenshtein distance there is for each depth j, 0 < j < m, of NFA in each step i 
of the run one integer variable dj^i that contains level number of the topmost 
active state in j-th depth of NFA. Each value of dij greater than fc -|- 1 can be 
replaced by value k + 1 and it represents that there is no active state in j-th 
depth of NFA in i-th step of the run. 

Term dj-i^i-i represents matching transition, term dj-i^i-i + 1 represents 
transition replace, term dj^i-i + 1 represents transition insert, and term dj_i i + l 
represents transition delete. 

The self-loop of the initial state is represented by setting do,i <— 0, 0 < f < n. 
Only the states reachable from qq by £-transitions are active at the beginning. 
Thus all transitions (paths) of the NFA are considered. 

Each element dm,i < k shows an occurrence of P with at most dm,i 
differences — the final state in level dm,i is active. 

In [2j they compress the matrix D since they use the property that {dj^i — 
G {0, 1}. 
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4 Adjusting Dynamic Programming 



Now, we show how the dynamic programming can simulate other NFAs. In the 
approximate string matching using the Hamming distance, the only allowed edit 
operation is replace. By omitting transitions insert and delete from the NFA for 
the Levenshtein distance we get the NFA for the Hamming distance. By omitting 
terms simulating insert and delete transitions in Formula [U we get the formula 
for simulating the NFA for the Hamming distance. In such a case we also have 
to change the initial setting of the vector, since only the initial state is active at 
the beginning (there are no e-transitions in such NFA). 

If we label replace and insert transitions by S (i.e., the transitions are used 
even if ti is the matching symbol), we get so called S version of NFA [^. This 
NFA has a little simpler formula, but it has a little bit different behavior. The 
original dynamic programming algorithm riMsFI for the approximate string 
matching using the Levenshtein distance simulates the S version of NFA. 

If we introduce transition transpose (two adjacent symbols are exchanged) 
into the NFA for the Levenshtein distance, we get the NFA for the generalized 
Levenshtein distance [1]. This transition leads from a state q of level I and depth 
j to an auxiliary state q' and then to the state q” of level I -I- 1 and depth j + 2. 
It reads two symbols: Pj +2 and pj+i. 

To get formula for the approximate string matching using the generalized 
Levenshtein distance we have to insert Term [2] simulating transition transpose 
into Formula [TJ 



if f > 1 and j > 1 and = pj and ti = pj_ithen dj- 2,%-2 + 1 (2) 



We get the sequence matching from the string matching, if we allow any number 
of symbols to be located between two adjacent symbols of the pattern. The NFA 
for the sequence matching we get from the NFA for the string matching just by 
inserting self-loop to each nonfinal state. Such self- loop of state q is then labeled 
by Pj, where Pj is a label of the match transition leading from q 2]- To get 
the formula for sequence matching, we should insert term if < m and ti yf 
Pj+i then dj^i-i (which simulates the self-loop) in formula for string matching. 

If we do not care about the exact number of differences in the found string 
and we only want to be sure that the number of differences is less or equal 
to k, we can use the reduced NFA [ 50 . The reduced NFA for the approximate 
string matching contains in each level only the first {m — k + 1) states. Other 
states are removed, since they are needed only to determine the exact number 
of differences in the found string. For the simulation of this NFA, we can use 

^ In origin, it was designed in order to directly compute the edit distance, which in 
the E version of NFA exactly corresponds to the values of the integer vector. Later 
it was shown that it simulates the corresponding NFA. 

^ A similar simulation of reduced NFAs (H-version) for the approximate string match- 
ing using the Levenshtein distance was independently presented in jl]. 
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another approach. We have one integer variable for each e-diagona0. See j6] for 
details. 

Weighted approximate string and sequence matching introduces for each edit 
operation {replace, insert, delete, and transpose) its weight {wji, wj, wd, and 
wt respectively). It can be either independent on the symbol being edited or it 
can vary according to the symbol. In the NFA it is represented in such a way that 
transitions representing various edit operations lead to states in various levels 
according to their weights. In formulae for simulating NFA it is represented in 
such a way that instead of increasing the value dj^i by one, we increase the value 
by the weight of the corresponding edit operation. For details see [6]. 

5 Conclusion 

In this paper we presented the simulation method the called dynamic program- 
ming. The method was demonstrated on the NFA for the approximate string 
matching using the Levenshtein distance. The way how to use the method for 
other problems is also demonstrated. 

The time complexity of the dynamic programming for the presented tasks is 
0{mn) {0{m\S\ + {m — k)n) for the reduced NFA) and the space complexity is 
0{m) {0{m\S\ +m — k) for the reduced NFA). 

Together with other simulation methods the dynamic programming creates 
an alternate to DFA when deciding, how to use NFA. 
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Abstract. A technique of parsing of cyclic strings using an adapted 
strong LR parser is described. The adapted parser is able to find the 
starting point of the normal form of cyclic string. The resulting algorithm 
of parsing has linear time complexity. 



1 Introduction 

One technique used for the recognition of two-dimensional shapes consists in 
a description of the contour of the shape by a linear string of primitives. A num- 
ber of approaches exists to obtain such a representation [FY86]. 

The linear string representing the planar object must be invariant with re- 
spect to the starting point. For instance, if we have a set of planar objects and the 
used primitives are symbols a and b, the strings ababaab, babaaba and abaabab 
represent the same object. It is useful to write all strings representing some class 
of objects in a normal form. Then it is possible to find a formal description of 
a set of strings written in a normal form in a given class. One of useful formal 
system for the description of sets of strings is a grammar. Having a grammar it 
is possible to use a parser for the decision whether a given string belongs to the 
language described by the grammar provided that the string is in the normal 
form. Therefore it is necessary for such a decision to put the strings in normal 
form. On the other hand, it is very useful to adapt the parser to be able to learn 
if there exists a cyclic rotation of the string such that it is a member of the lan- 
guage described by the grammar. Oncina [ON96] has described an adaptation of 
the Cocke- Younger-Kasami context-free grammar parser in order to use it with 
cyclic strings. The obtained parser has the same time complexity as the original 
one (0(n^)). 

In this paper, an adaptation of strong LR parser is described in order to use 
it for cyclic strings. The adapted parser has linear time complexity. 

* This research was partially supported by grants 201/02/0125 of the Grant Agency of 
Czech Republic and by the Ministry of Education, Youth, and Sports of the Czech 
Republic under research program No. 404/98:212300014 (Research in the area of 
information technologies and communications). 
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A (right) cyclic shift is a mapping a : T* ^ T* , defined by a{ai . . .an) = 
a„oi . . . a„_i. Let ct’" denote the composition for all m > 0 and let denote the 
identity. Two strings x and y in T* are equivalent ii x = cr™(y) for some m > 0, 
where m is the length of the cyclic shift. Clearly, this defines an equivalence 
relation in T* . The equivalence class of a string x will be denoted with [x] , and 
will be called a cyclic string. Let A CT*, then [A] = UkeaN- 

2 Strong LR Parsing 

We use standard notation as defined in [AV71]. Let G = {N, T, P, S) be a context- 
free grammar. The augmented grammar for G is G' = {N U {S"}, T, P U {S' 

S'}, 5"), where S' ^ N. All derivations used in this paper are the rightmost 
derivations. Let fc > 0. A grammar G is said to be strong LR{k) if, in the 
augmented grammar G for G, the conditions 
S aiAw ai(3w = a'lXiw, 

S 'yBx o.2l3y = 02 -^ 2 ?/, 

Ai = A 2 and FIRSTkiw) = FIRSTk{y) 
always imply that A = B and x = y. 

The strong LR parsing is described in [MEL88]. The construction of the 
parsing table is based on the collection L of sets of strong LR{0) items. The 
construction of this collection is similar to the construction of canonical collection 
of sets of LR{0) items. 

The symbol to the left of the dot in an LR{0) item is used as the name of 
the set of strong LR{0) items in the collection L. This symbol is unique for each 
item in the collection with the only exception of the initial set. Its name is ^ in 
all cases. The difference between canonical collection and the collection L is: 

If two sets with the same name X are in L then replace both such sets by the 
union of them. So the collection L contains the initial set and just one set for 
each grammar symbol. The goto table is not necessary and only the parsing 
table is used. It is constructed as is usual for simple LR grammars. The parser 
for strong LR grammars has not the correct prefix property, because of doing 
union of some sets during the construction of the collection L. The standard LR 
parser must be modified in order to obtain the strong LR parser in this way: 
During the operation reduce it must check if the right hand side of the reduction 
rule is at the top of the pushdown store. If not, the parsing ends with the error 
signalization. 

3 Parsing of Cyclic Strings 

The parsing of cyclic strings in the normal form starts with the initial symbol 
(#) on the pushdown store and the first k symbols of the input string as the 
lookahead string. The parsing of a cyclic string can start in an arbitrary point 
of the input string. Let us use two observations: 

1. We can divide a sequence of configurations of the parser parsing some in- 
put string on a number of subsequences each one starting in a configuration 




Deterministic Parsing of Cyclic Strings 



303 



with the terminal symbol on the top of the pushdown store (with an ex- 
ception of the first subsequence). Each such subsequence consists of some 
number of transitions corresponding to reductions followed by a transition 
corresponding to the shift of one terminal symbol (with an exception of the 
last subsequence). 

2. There is for each terminal symbol on the top of the pushdown store and for 
each lookahead string at most one non error entry in the strong LR parsing 
table. 

Using these observations, we can start parsing of the input string in an arbitrary 
point provided that a substring of A: -I- 1 terminal symbols is seen by the parser. 
The first symbol of this substring is the symbol on the top of the pushdown store 
and the rest is the lookahead string of the length k. 

A configuration of cyclic parser is a quadruple (^w, x, tt, a), where is 
the contents of the pushdown store, x is the rest of the input string, tt is the 
concatenation of the substring of the right parse and the prefix of it, and a is 
the expected string. 

The expected string a is created during reductions. Let the cyclic string 
with normal form x = 0102 • • • On be cc™ = a„_m+i ' ' ' <inCii<i 2 • • • dn-m, where 
m is the length of the cyclic shift. It is composed of two parts. The first part 
Un-m+i ■ ■ ■ dn is a suffix of X and it is parsed starting with symbol an-m+i- 
There is possible that during the parsing of this suffix a reduction needs some 
string of symbols, which is expected on the top of the pushdown store but which 
was not pushed on it. This is a string which can be obtained during parsing 
of the prefix of x which will be parsed later. Therefore the parser is creating 
the expected string in order to check the correctness of the cyclic string and to 
compute the length of the cyclic shift. 

To continue parsing after the suffix an-m+idn-m +2 . . . a„ is already parsed, 
we must ensure that parsing with symbol ai, i = n — k + l,n — k + 2, . . . ,n, 
at the top of the pushdown store and the F I RSTk{di+i . . . a„aia 2 ■ ■ ■ dm) as 
the lookahead string is possible. The starting point of classical LR parsing is 
the configuration {-^,xiX 2 ■ ■ ■Xn,s). Therefore we must simulate this situation 
during the parsing of cyclic string in the configuration a^+i . . . a„aia 2 . . . 

a„-m-dn-m+i ■ ■ ■ dn-m+k, 71") • For this reason we must modify the parsing table. 

We restrict our further explanation to the case k = 1. 

The modification of the parsing table consists of two moving operations: 

1. The moving of reduction operations which are performed at the end of 
the standard parsing (in the column for e) to the column for symbols in 
FIRST(S) in the same row. 

2. The moving of operations which are performed at the beginning of the stan- 
dard parsing (operations in the row for the initial symbol ^) to the row, 
where is the operation accept. 

The target entry of these moving operations can contain three possible values: 

1. It is an error entry. 

2. It contains the same operation as is the moved operation. 

3. It contains another operation (shift or reduce) than is the moved operation. 
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The moving operation is simply possible in the first two cases. It is even 
empty operation in the second case. There is the conflict in the third case and 
the moving is not possible. We obtain the modified parsing table MC from the 
parsing table M by performing of moving operations and the removal of the row 
of M for the initial symbol If there is some conflict then the cyclic parsing is 
not deterministic. Otherwise the cyclic strong LR parsing is deterministic using 
an adapted strong LR parser. The adaptation of a strong LR parser to obtain 
a cyclic parser consists of the following changes: 

1. Variable C is introduced as a counter of positions in the input string. 

2. The starting and ending positions of symbols pushed to the pushdown store 
{start and end, respectively) are added. Let A ^ P = BP D be a rule 
used for reduction of string P to a, nonterminal A, where B,D G {N U T), 
then start{A) = start{B) and end{A) = end{D). The starting and ending 
positions of terminal symbol are the same and they are equal to the position 
of this symbol in the input string. If the starting position is not defined for 
some nonterminal symbol X, then start{X) = —1. 

3. The positional information is added also to the elements of the right parse. 
The ending position of the nonterminal symbol which is the result of some 
reduction, is added to the corresponding element of the right parse. 

4. An input of the parser is an input string with its first symbol appended at 
the end as a lookahead string. 

5. The first operation of the parser is shift. This enables to start the parsing 
with a terminal symbol on the top of the pushdown store. 

6. A reconstruction of the configuration of the parser is performed after parsing 
of the input string. The result of this reconstruction is a configuration of the 
parser parsing the input string in the normal form just after the shift of the 
last symbol. 

Next algorithm is the strong LR{1) parser adapted for parsing of cyclic strings. 

Algorithm 

Cyclic strong LR{\) parsing. 

Input: Cyclic parsing table MC for grammar G = {N, T, P, S), cyclic string x. 
Output: Right parse for x in the normal form or error signalization. 

Method: String y is the lookahead string. Symbol X will be the top symbol 
of the pushdown store. String x' = x.FIRST{x) will be the input string. The 
dot (.) is used for the separation of the cyclic string and the appended string. 

1. Set a = e, -K = e, C = 1. 

2. Read the first input symbol and push it on the pushdown store with its 
position equal to one. 

3. Repeat next steps until the input is parsed or an error signalization appears: 

a) String y is the lookahead string (first input symbol from the rest of the 
input string). The appended string {FIRST{x)) is used only in case 
when the expected string a is not empty. Otherwise it is supposed to be 
empty and the cyclic shift is equal to zero. 

b) If MC{X, y) = shift, read the input symbol 2 = FIRST{y), C := C +1 
and push z{C,C) to the pushdown store. 
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c) If MC{X,y) = reduce (i), where A ^ [3, (] = B(3'D is the i-th rule in P, 
where B, D £ N U T, then: 

i. If string [3 is on the top of the pushdown store, then pop it and push 
symbol A on the top of the pushdown store with start(A) = start(B) 
and end{A) = end{D). 

ii. If only the suffix j32 oi (3 = P 1 P 2 is on the top of the pushdown 
store, then set a := /3ia, pop P 2 from the pushdown store, and push 
symbol A on the top of the pushdown store with start{A) = — 1 and 
end{A) = end{D). 

In both cases, add Ri{C) to the right parse: tt := 7ri?i(C), where C = 
end{A). 

d) If MC{X,y) = error then finish with an error signalization. 

4. If string a is a suffix of to in the configuration e.g. to = 7a, 

then the length of the cyclic shift is equal to m, which is the ending length 
of input string minus the position of the rightmost symbol of 7, else finish 
with error signalization. Divide tt into two parts tt = ttitt 2 so that the first 
symbol of 7T2 corresponds to the reduction made in some position which is 
greater then the length of the cyclic shift. The next configuration will be 

(#7,^,7’‘27Ti,£). 

5. Do all possible reductions: 

If M{X,e) = reduce{i), A ^ (3 = B(3'D is the ith rule in P, then pop string 
f3 and push symbol A on the top of the pushdown store with start{A) = 
start{B) and end{A) = end{C). Add Ri{C) to the right parse: tt := 7rRi{C). 

6. li M{X,e) = accept then finish, tt is the right parse of input string in the 
normal form. Symbol S{p,m) is in the pushdown store. The cyclic shift of 
the input string is its length minus m. 

4 Example 

Consider the augmented strong LR{1) grammar G = ({S", S', A}, {a, &, c, d}, 
P, S') where P consists of the rules: 

(0) S'^S (2) A^aAb 

(1) S ^cA (3) A^d 

This grammar generates language L(G) = {ca^db^ : n > Oj. 

The collection of sets of strong LR{0) items for G contains the following sets: 
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= {A- 
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A- 


■r aA.b} 
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= {A- 


d.} 


c = { S - 
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A- 
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■r .aAb 


S = 
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A - 


■r .d} 


A- 


■r .d} 









The parsing of string adbbca using table MG will be performed in the fol- 
lowing way: 

(#, adbbca.a,e, e) 

dbbca.a,e, e) 

h(#a(l, l)d(2, 2), bbca.a,e, s) 
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h(#a(l,l)A(2,2), bbca.a,R3{2), e) 

h(#a(l,l)A(2,2)6(3,3), 6ca.a, i?3(2), e) 

h(#^(l,3), 6ca.a,i?3(2)i?2(3), e) 

h(#^(l,3)6(4,4), ca.a,i?3(2)i?2(3), ff) 

h(#^(-l,4), ca.a,i?3(2)i?2(3)i?2(4), a) 



h(#5(-l,4), ca.a,i?3(2)i?2(3)i?2(4)i?i(4), ca) 

h(#5(-l,4)c(5,5), a.a,i?3(2)i?2(3)i?2(4)i?i(4), ca) 

h(#5(-l,4)c(5,5)a(6,6), .a, i?3(2)i?2(3)i?2(4)i?i(4), ca) 

Now the parsing of input string stops and after the reconstruction of the confi- 
guration the parsing continues: 

h (#5(-l,4), £,i?3(2)i?2(3)i?2(4)i?i(4),£) 

The parsing finishes by operation accept and the cyclic shift is 2. 

The parsing table M and the modified parsing table MC are: 
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Rs 
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Ri 
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A 



In these tables, the abbreviations are used: A for accept, s for shift, and Ri 
for reduction by rule (z), error entries are empty. 

5 Conclusion 



We have shown how the deterministic parsing of cyclic strings can be done. 
Strong LR parsing is the method used as a base. The time complexity of cyclic 
string parsing is linear (0(n)). The open question is the extension of presented 
approach in two ways. The first one is the extension of the used method in order 
to use more than one lookback symbol. The second one is the adaptation of top 
down parser for parsing of cyclic strings. 
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